libriscv: RISC-V Binary Translation, part 2

fwsGonzo
10 min readJun 2, 2024

Low latency high-performance emulation on end-user systems

Previously: Using C++ as a Scripting Language, part 13

Also: An Introduction to Low-Latency Scripting for Game Engines

Sandboxing

It’s hard to achieve good performance with something that is primarily trying to be a safe sandbox. Sandboxes have to place limits on everything in order to not let untrusted code run rampant on your system. In my case the goal is also to achieve low latency. Binary translation adds complexity too because it has the same goals.

Binary translations ideally work across processes and across compilers. Sometimes they can work across systems. This is because generating generic freestanding C has always been fairly OK. Sometimes you hit a long-vs-64-bit-integer bug when trying to run on Windows, but that’s OK. It’s the first thing you check for if you have an issue. I think the last one I had was 1UL << shift_value , and it’s not going to match a grep for long. What saved me was trying to compile it manually in MinGW. The compiler let me know that a shift width was larger than the operands size.

Current state of binary translation

I’ve made some progress on the performance and reliability of the binary translation lately. It now passes the unit test suite with both Newlib and glibc RISC-V toolchains. It also reaches 85% of my native CoreMark score, which is nice. The JIT-compiled mode is slower than a full translation, but faster than interpreted, and since it starts instantly it’s OK to use it while developing the game.

Lately I’ve been working towards supporting binary translation on Windows. Specifically, MinGW as my game has a build for that.

High-performance emulation for end-users

The normal state of the emulator is in interpreter mode. It’s the safest ( lowest attack surface), highest iteration rate when working on my game, and so on. Even though interpreter mode is not slow by any means, it’s still at least 10x off what binary translation can do. For example, my interpreter has a CoreMark of 3550, while binary translation with safe (shippable) settings has 32k. JIT is nice, of course, but it’s dangerous for sandboxing. So dangerous in fact, that the v8 team declared that changing C++ to Rust would make no difference, and instead added a sandbox layer on top of v8, simply called Sandbox mode. So, while my emulator has a JIT mode using libtcc, I think for shipping a final product, I will be using my pre-compiled binary translation.

For gaming, most users are going to be on Windows. At least for the initial release of the game. However, I don’t really want to attempt to compile anything on end-user systems. So, my thinking was that I could try to compile a generic shared library on the server of the game, and since the server already sends other resources to clients, why not one more?

DLLs and shared objects are nothing special. Just binaries with different setup for how to load and run them. In this case we just want to match the ABI of the game client on Windows, and so we should cross-compile a Windows DLL. Hopefully the result is a very generic dependency-free shared library. I knew from before that there is a 64-bit Windows MinGW cross-compiler on my Linux distro. So, I used it on the translators output, and I tried various settings like -ffreestanding and -nostdlib, but in the end I found that it was enough to just build it without any fancy arguments. I grepped strings in the DLL created by my MinGW cross-compiler and I found only msvcrt.dll and kernel32.dll, which I believe everyone has. Correct me if I’m wrong. (EDIT: I’ve since checked, and yes, msvcrt.dll comes with your OS, unlike msvcrtXX.dll, which does not!)

So, I think it’s possible for the server to at any time during execution, whenever it feels like it, produce this extra .dll for Windows clients. Once it has it, it can be sent as part of the introduction payload. Or perhaps even later. Send it over a low-priority channel? A binary translation DLL can also be applied to the emulator at any time, since it is a matter of writing an aligned value in each location where we want to enter a binary translated function instead of executing a bytecode.

I added a feature in my emulator that lets you build MinGW DLLs during translation. Then, on Windows systems I made it possible to load binary translation DLLs (but not compile them). I used an emulated dlopen/dlclose in order to avoid complexity in dealing with the loading. Unfortunately, the shared libraries are quite big, even with Zstd:

* Received resource file bintr/rvbintr-2829B278.so with length 1109961 => 6486104

I tried stripping them, but no dice. We’ll see what I can do later. I might run an async thread in the background that compresses it much more than level=16. As the compression level increases, the compression time goes up exponentially. Not something I’m willing to wait for when starting the server.

I made the server compile and link a Linux .so and a MinGW .dll at startup:

* Loading program std from file 'mod/std/programs/std.elf'
MinGW Command: x86_64-w64-mingw32-gcc -O2 -s -std=c99 -fPIC -shared -x c -fexceptions -DRISCV_ARENA_OFF=1808 -DRISCV_MAX_COUNTER_OFF=1864 -DRISCV_INS_COUNTER_OFF=1856 -DRISCV_ARENA_ROEND=4857281 -DRISCV_MAX_SYSCALLS=600 -DRISCV_ARENA_END=29360128 -DRISCV_TRANSLATION_DYLIB=8 -DARCH=HOST_AMD64 -pipe -o rvbintr-2829b278.dll /tmp/rvtrcode-ykmmwV 2>&1

It will only produce missing files, and they are cached in a subfolder. In order to produce this, there is now a struct dedicated to cross-compilation, which can be tacked onto the MachineOptions struct during instantiation:

 struct MachineTranslationCrossOptions
{
/// @brief Provide a custom binary-translation compiler in order
/// to produce a secondary binary that can be loaded on Windows machines.
/// @example "x86_64-w64-mingw32-gcc"
std::string cross_compiler = "x86_64-w64-mingw32-gcc";

/// @brief Provide a custom prefix for the mingw PE-dll output.
/// @example "rvbintr-"
std::string cross_prefix = "rvbintr-";

/// @brief Provide a custom suffix for the mingw PE-dll output.
/// @example ".dll"
std::string cross_suffix = ".dll";
...
}

The server will output the .dll relative to itself, so that it can load it on-demand and transfer it to new clients. I prefer storing shared objects on disk so that I can inspect them, if necessary.

The game client can receive resources from the server, and they are not tagged in any special way. They are just relative filenames + payload. So, the game client doesn’t know if it got a matching binary translation as part of the login sequence. And that’s fine, because ultimately, the client must instantiate a machine and generate a checksum in order to figure out which file it would want to load into memory, for the sake of our sanity. If it loads a mismatching .dll, symptoms can range from the benign to horrible mysterious crashes.

I used a CRC32-C checksum for this, with the execute segment, and then xor in build options and major emulator settings.

It’s possible that the server sends a .dll and the Windows client ends up not using it. And that’s fine, because it means the game will continue to work as normal, just without binary translation. Where the alternative, trying to use a potentially slightly incompatible .dll, can cause random crashes and other unpredictable effects.

MinGW woes

The numbers look good except on one benchmark:

> lowest 18ns   median: 19ns    highest: 21ns
[std] Measurement "Block::find" median: 19

> lowest 29ns median: 30ns highest: 32ns
[std] Measurement "Block::isInGroup" median: 30

> lowest 111ns median: 111ns highest: 113ns
[std] Measurement "Rainbow Color" median: 111

> lowest 6ns median: 6ns highest: 6ns
[std] Measurement "Game::is_client()" median: 6

> lowest 5ns median: 5ns highest: 5ns
[std] Measurement "Overhead" median: 5

Above is me running my game client with Wine. Below is native Linux:

> lowest 12ns  median: 12ns  highest: 12ns
[std] Measurement "Block::find" median: 12

> lowest 14ns median: 14ns highest: 17ns
[std] Measurement "Block::isInGroup" median: 14

> lowest 21ns median: 21ns highest: 22ns
[std] Measurement "Rainbow Color" median: 21

> lowest 5ns median: 5ns highest: 5ns
[std] Measurement "Game::is_client()" median: 5

> lowest 5ns median: 5ns highest: 5ns
[std] Measurement "Overhead" median: 5

We can clearly see that the rainbow calculation takes 2.5x more time on Windows. I tried running the Windows executable on a real system, and through Wine and Proton: They all largely agree on the performance. I also tried to build the MinGW .dll in a Ubuntu 24.04 distrobox instance, in order to get a binary translation DLL using a newer GCC version. The main game executable is built using 64-bit MinGW which has up-to-date compilers. LTO changed nothing. Switching between Clang-18 and GCC-14 changed nothing. The rainbow calculation is fairly simple: We enter the sandbox, calculate sinf 3x times and return as 32-bit packed. It’s simple enough to follow what it does from start to finish. So, where is all the overhead coming from?

Here is the function (from a previous blog post):

uint32_t stdRainbowBlockColor(Block, int x, int z)
{
static constexpr float period = 0.5f;
const int r = api::sin(x * period) * 127 + 128;
const int g = api::sin(z * period) * 127 + 128;
const int b = api::sin((x + z) * period + 4.0f) * 127 + 128;
return 255 << 24 | r << 16 | g << 8 | b;
}

Well, one easy check is to just replace the host-side implementation of the api::sin function with nothing. It should reduce the benchmark to mostly call overhead. And it does. It goes from 113ns to 13ns (call overhead + 3 empty system calls). So, I look for a decent sin function for gamedev (and I do know that there’s a sin function for every kind of use-case out there). And voila, both Windows and Linux agree on the performance of my script. It’s slower than std::sin on Linux, but that’s OK. So, this time it wasn’t me. I had to update every single benchmark, but I was happy to do so!

Results

I run the Windows executable from steam, via Proton.
Proton 9.0.1 on Linux running my game with sandboxing & binary translation.

The easiest way to test it is to run it in Proton. It’s working well, but it’s stressing my CPU fans a lot. I really need to put breaks on the physics thread. And probably also check if FPS is capped properly on Windows.

Running in Wine on Linux, we can see that binary translation is enabled.

Running it in Wine locally also worked. Roughly the same latency numbers, but I think Proton 9.0 is much newer (and overall faster). I realize now that I need a computer for Windows testing.

Running natively on Linux using my Ryzen 7950X. Insanely low latencies.

The measurement numbers I’m used to: Natively running on Linux. Butter-smooth too. These measurements are running on startup of both the client and server. They don’t really affect startup time, which is currently 0.11s for the server and 1.2s for the client. So code iteration speed is still quite high.

I did spend some time learning about the Windows ABI, and to the best of my knowledge as long as I’m calling a function with 4 or less integral values, it should be using registers. But I am worried about the cost of the shadow stack. According to The Performance Cost of Shadow Stacks and Stack Canaries by Thurston H.Y. Dang et al, it is not insignificant (~10%). Can the shadow stack explain the (small) call overhead difference between Linux and Windows? I know that I am being inefficient when returning from binary translation due to returning a 16-byte struct. Very efficient on Linux, but probably questionable on Windows.

Verification on real Windows 10

I resurrected my old Dell laptop, and ran two benchmarks that, well, spins the CPU a little bit.

My old Dell laptop had Windows 10

As you can see, the binary translation makes quite a dent in the Fibonacci calculation, while giving a modest 2x run-time reduction to the rainbow-color calculation, which is a real function that my game is using. It was quite nice to see that once I modified the script on the server (my Linux machine), the Windows client would receive both it and the binary translation with it, and it just works. Quite nice!

A previous benchmark with binary translation

In a previous blog post i benchmarked a rainbow color function that I was working on. This time I can add binary translated libriscv running on Linux through Steam Proton. 😅

I resurrected my old Dell laptop with Win10 for this. All benchmarks are Ryzen 7950X, except ye olde Dell.

The Dell laptop measurement is with binary translation, for the record. It’s a bit slower than my desktop computer, and it shows. But that’s OK — the game is very playable, and that’s all that matters. libriscv is a comfortable 5.25x faster than wasmtime (7.25x for sinf host-function). Wouldn’t be much of a low-latency framework if it wasn’t actually low latency, now would it!

I think the binary translation is quite fast now, and more important it’s reliable enough for me to start using it default-enabled. Being able to use it on end-user systems is a nice achievement, too. I wonder why I didn’t think of this before!

-gonzo

--

--