Lowering the latency of the lowest-latency emulator

5 min readJun 21, 2024

Prepared calls in libriscv

Previously:

Most function calls into the VM are not random. The call address is well known ahead of time. But until now I have never been able to use this knowledge to reduce the overall latency. The reason is just that interpreter overheads are already insanely low, usually sitting at around 3–4ns call overhead.

Lately I’ve been experimenting with using libtcc as JIT-compiler, and since it is now considered stable in my emulator, I’ve been trying to use it in my game. Unfortunately, the call overhead made it ultimately worse than interpreter mode for many functions. Game script functions are usually tiny, argument heavy and similarly heavy on host calls. This is low-latency in both interpreter mode and with binary translation, but in order to enter a binary translated function there is currently some extra logic to wade through in dispatch. Binary translation does extra well when there are argument heavy host calls. My terminology may seem mixed up, but the libtcc JIT-compilation uses the binary translator as source.

Prepared VM calls

My idea was to try to make something I term prepared calls. They cannot improve interpreter latency, but it would be nice if they also add no overhead to the interpreter when binary translation isn’t enabled. And when it is, try to short-circuit the dispatch and directly call a translated function. This turned out to be extremely complicated, and landing on a safe-but-low-latency solution was much harder than I thought it would be.

It’s dangerous to store stale information and then try to use it later. All kinds of hell can be experienced if any assumptions are wrong. So, I’ve been working on it and testing it in every single project that uses the emulator, and it took a while to stabilize it fully. So many surprises.

Finally, there’s a drawback with the current implementation: If the translated function has to return back to dispatch, it will instead just return back to the short-circuiting logic, which then has to re-enter real dispatch.

The left side is the low-latency fast-path.

The usual vmcall logic is on the right side, where we enter dispatch, read a bytecode and very quickly enter a translated function. The translated function often ends the simulation by returning to a function that stops the machine, which also has to be executed. In order to take advantage of this knowledge, I made it so that it’s possible to directly call the translated function directly (left side), and upon return check if we need to go into dispatch to continue emulation or if we can directly exit already (knowing the address of the exit function). It turns out that being able to directly stop just after the translated function was the biggest improvement that even pushed TCC latencies lower than interpreted.

Results

Using full binary translation: It’s definitely an improvement.

> lowest 10ns  median: 10ns  highest: 10ns
[std] Measurement "Block::find" median: 10

> lowest 11ns  median: 11ns  highest: 11ns
[std] Measurement "Block::isInGroup" median: 11

> lowest 18ns  median: 18ns  highest: 20ns
[std] Measurement "Rainbow Color" median: 18

> lowest 4ns  median: 4ns  highest: 4ns
[std] Measurement "Game::is_client()" median: 4

> lowest 4ns  median: 4ns  highest: 4ns
[std] Measurement "Overhead" median: 4

The new results are above ^.^

> lowest 12ns  median: 12ns  highest: 12ns
[std] Measurement "Block::find" median: 12

> lowest 14ns  median: 14ns  highest: 17ns
[std] Measurement "Block::isInGroup" median: 14

> lowest 21ns  median: 21ns  highest: 22ns
[std] Measurement "Rainbow Color" median: 21

> lowest 5ns  median: 5ns  highest: 5ns
[std] Measurement "Game::is_client()" median: 5

> lowest 5ns  median: 5ns  highest: 5ns
[std] Measurement "Overhead" median: 5

Above are the old results from my previous blog post, also using full binary translation.

I instrumented Event, adding counters for which events are directly calling a translated function and the total events created so far. I got 112/123 on the server, which is ~91%. 144/155 on the client, so around ~93%. Pretty good!

Putting all this on a graph:

Interpreter mode is no longer the best choice for default.

Now we can see that the old binary translation results have been improved upon with prepared calls. And more importantly, TCC now actually has improved baseline call overhead, finally making it decidedly better than interpreted. It was always much better for longer functions, but they are few in game scripting.

In order to be sure of things, I redid an old benchmark (from Part 8):

If you’ve ever written an interpreter before you know that performance is all over the place, and this time it just happened to be golden. What JIT and binary-translation brings to the table is consistent performance. Ignoring the golden interpreter dispatch, we see that libtcc with a very basic register allocator does quite OK. This benchmark is with the new prepared calls, and we actually gained 1ns with full-translation. It gives me confidence in the prepared call implementation.

Conclusion

Making JIT-compilation with libtcc have low-enough latency to make sense as a default has been a real challenge. But now, finally we can say that it always makes sense. Another potential benefit is that it’s also available on Windows. Now, the final work is to unify TCC and full binary translation so that we can have JIT-compilation enabled by default, but if the server transmits a full binary-translation DLL, we can use that instead.

The default used to be interpreter mode, because it allowed me the highest iteration speed when making my game.
Stabilizing JIT-compilation with libtcc, combined with the generally reduced overheads, it can now become the new default when working on the game.
When the game is shipped, the game script will use full binary translation, falling back to libtcc only when not available.

I gave libtcc a round of fuzzing and it never found anything, so I think my worries about it being scary in production may have been wrong. We’ll see.

-gonzo

Lowering the latency of the lowest-latency emulator

Prepared VM calls

Results

Conclusion

Written by fwsGonzo

No responses yet