Time-to-first-instruction

fwsGonzo
5 min readDec 15, 2022

It matters when your games script actually starts running

Hey all. This post is about a common issue you may be facing when sandboxing your games script. Maybe this specific problem will be familiar to you? I want to talk about execution timeouts.

Execution timeouts

If you are going to sandbox something, having an execution timeout is a necessity. It doesn’t matter if security is not important, or even a consideration, but you must have a timeout so you don’t have a thread in your engine just spinning in a loop. So, let’s enumerate the two most common ways to do timeouts:

  1. Signalling or otherwise interrupting a thread with a running simulation.
  2. Counting instructions or jumps.
  3. sigsetjmp and friends (but it’s a wasp nest).

Maybe there are others, but these are the two paradigms that matter for this post. If you add KVM in the mix there is actually a 4th option, but I will not go into that. In wasmtime the first is supported by the epoch API and the second is using the fuel API.

Now that we know the two ways that we can stop execution, we should look at the performance characteristics of each one.

Execution in a secondary thread

threadpool: task median 5364ns    lowest: 4701ns      highest: 6060ns

A casual micro-benchmark of a gold standard C++11 threadpool task that returns a future with return value and supports forwarding exceptions. On a live system the overhead is going to be much more than this, probably 2–5x. Let’s just round it up to 10 microseconds, as that will cover more minimalistic implementations people might have as well as relatively idle systems for undemanding games. Remember that the people playing your game is running with ondemand frequency scheduling and so C-states will change all the time. You might even find your task took 100 microseconds to even land on the thread.

threadpool: task median 11152ns    lowest: 5556ns      highest: 14649ns

With the ondemand CPU frequency scaling governor. Seems right for a micro benchmark.

Execution in same thread

Lua and LuaJIT supports running from your current thread by hooking up instruction counting that traps on a deadline. It trashes performance, but that’s just an implementation detail. My RISC-V emulator also uses this method and the counting is not affecting performance anywhere near that degree.

This allows the simulation to reliably stop without the overheads related to thread synchronization. With proper timeouts (eg. SO_RCVTIMEO) and timers in system call emulation we can also prevent system calls from blocking forever.

Crucially, this method adds zero overhead up-front and after emulation, which is important in this context, allowing us to skip thread synchronization completely.

API-centric emulation

Have you written a game engine or a complex game that uses a lot of scripting? If so, you will probably know exactly what I’m going to write now: You overwhelmingly make API calls into the game engine (combined with memory sharing if your scripting solution supports that), and so the only thing that matters to you is:

  1. The overhead of beginning simulation (eg. luascript.call(“my function”))
  2. The overhead of making an API call (eg. engine.do_something(object))
  3. The overhead of stopping the simulation and returning the result back to (and resuming) the caller.

So, if you have a budget of 1 microsecond, which is a long-ass time when you think about it, then it is no longer convenient to synchronize handing tasks to another thread just for execution timeouts. Maybe you skip it altogether, and just hope for the best.

Or, maybe you decide hook up the instruction counting hook for Lua? If you do: despite the performance drop in Lua and LuaJIT, you will be way better off. The extra call overhead and lower performance still gives you a big win over trying to execute the script in another thread.

I made some measurements of each: LuaJIT and Lua.

LuaJIT gets a slight call overhead increase from 55ns -> 80ns when instruction counting is used. I’m not exactly sure why. Still, it’s no big deal. Like other emulators, including WebAssembly emulators, it’s practically free to make API calls and there is no cost to stopping the emulation.

How much can you do in 1 microsecond?

Most in-game events are just simple functions that manages timers, cameras, story and plot-related things. Nothing compute-heavy, just things like simple programming of in-engine constructs. I work with one of the WASM founders, and even he told me that we generally outsource compute-heavy work to WASM modules.

So, how many bytecodes can the worlds slowest interpreter chew through in just 1 microsecond? Well, it’s possible to speculate, but I just ran a rotate-around vector system call instead. The following code is my script running a benchmark (on a function in the script):

 static float angle = 0.0f;
static vec2 v {1.0f, 1.0f};
measure("Vector rotate", [] {
v = rotate_around(v.x, v.y, angle);
angle += PI / 4;
});

It measures the time it takes to rotate a vector with a given angle. It’s implemented as a system call that returns two float values. This is how the script invokes the vector length system call:

inline vec2 rotate_around(float dx, float dy, float angle) {
const auto [x, y] = fsyscallff(ECALL_VEC_LENGTH, dx, dy, angle);
return {x, y};
}
> lowest 19ns  median: 21ns  highest: 21ns
[level1] Measurement "Vector rotate" median: 21

The implementation in the engine is pretty standard:

APICALL(api_vector_rotate_around)
{
auto [dx, dy, angle] = machine.sysargs <float, float, float> ();
const float x = std::cos(angle) * dx - std::sin(angle) * dy;
const float y = std::sin(angle) * dx + std::cos(angle) * dy;
machine.set_result(x, y);
}

It uses std::cos and std::sin 4 times, which takes some time to perform, so it’s a decent test. (1000ns - 65ns) / 21ns gives us 44 calls per microsecond. That’s way more stuff than your average script event would need to do. Make that 473 calls while waiting for that 10us thread task.

But wait, some events must do I/O-related things? Maybe not. The engine could handle all I/O things in dedicated subsystems with their own threads, and could resume the script at the appropriate time. The less error handling in the scripts, the better, right? And, it’s easy enough to switch between and resume scripted events. Perhaps I’m wrong and reality gets in the way here and there, but that would be my ideal structure.

There is of course nothing wrong with fire-and-forget events that do run in threads.

Conclusion

You can do an absurd number of engine API calls in your script events using a slow interpreted bytecode-based scripting backend before the overhead of the thread synchronization has been reached.

-gonzo

--

--