libriscv: RISC-V Binary Translation
Translating RISC-V binaries to efficient native code
I’ve had some thoughts about how to translate instructions one by one into C++ code and then try to run that. I knew that I would want to translate RISC-V to some simple low level code, compile it and then load it as a shared object. The hope was that it would provide a significant speedup, removing some of the emulation overhead. libriscv is already a fairly fast emulator that decodes every instruction into a function pointer for a single execute segment, however that’s naturally far behind native performance. With binary translation the hope is that this will allow me to come within 1/4 of native, although not in the first version.
This all began when I was reading the famous Popek & Goldberg paper on the requirements of virtual machines, and I realized I had to at least try to create some kind of efficient code for maybe the most important parts of the programs.
v1: Translating to C++
In the first version I started translating the RISC-V instructions one by one into C++, which could directly transform the state of the RISC-V machine. For the tiny executables I was working with this was not a problem. I quickly found out that code-block detection is really an art. I ended up stopping at every kind of branch and system instruction. In the end this would provide very little speedup due to many factors. One being not supporting enough instructions yet, as I was incrementally supporting more and more. The other issue was that translated blocks ended up being too short, adding overhead from entering and leaving this code. Worst of all though, the compilation times were horrible on bigger programs, and I was dependent on headers from the RISC-V emulators own project!
v2: Larger code blocks
By only exiting the translated native code when a branch actually hit I could continue translating longer blocks which provided a bigger speedup. I also generated a custom API header which I embedded into the shared object, and freed myself of unneeded dependencies. Compilation times dropped significantly, but C++ remained complex to parse. I checked out the -ftime-report and there wasn’t much I could do other than contemplate moving on to C code generation. I actually seriously looked into how to reduce compilation times in general for both C and C++, and I couldn’t find much that I didn’t know already: Compile on tmpfs, reduce header includes, avoid templates and so on.
v3: Translating to C, loop detection
By rewriting the whole thing to generate standard C the compilation times were now low enough to make this thing more usable for bigger executables. I also finished all the normal instructions and I saw significant speedup vs. my own untranslated benchmarks.
However, the biggest change was detecting loops! I was unable for the longest time to be faster than LuaJIT for fib(40) due to not being able to return to native code at loop locations. By scanning for negative-offset branch instructions and going back later to generate code for them I was able to fly right by LuaJIT. That is of course not very surprising when done properly, because I’m translating an optimized executable, while LuaJIT is doing everything alone — very quickly too! Truth be told, I’m comparing against Lua simply because of how easy it is to integrate. And it serves as a good target to compare against for a hobby project!
Nevertheless, this has been an incredibly rewarding and fun thing to do.
You can have a look at the work in progress binary translation here.
Since compilation can take some time depending on the size of the program it would be nice if we could just keep it for later, in a temp location. So, if you enable the feature in the library, it will CRC32 the code and compiler arguments and use the checksum as part of the filename. We can then find the same file again later. We don’t want to have obscure bugs where some minor change caused a cached shared object to behave erroneously, so this is a great way to avoid that. I call it this the translation cache, and it’s shared libraries that are compiled from translated RISC-V. The code generation is quite fast, and it can be made faster again with C++20’s std::format whenever that is widely available. The check-summing is also very fast as I made sure to use the CRC32-C instructions when SSE4.2 is available.
It’s certainly very interesting seeing the translator ignoring your new binary if all you changed was the names of some variables and some comments. Sometimes the code is the same, and the checksum will reflect that.
So let’s talk about some of the hurdles that remain, and might not ever be solved. First off, when you scan for code blocks to translate you have to use a treshold for minimum block size. If you don’t you can sit there and translate forever and increase compile times, very likely with little benefit.
Another is that there is a maximum number of code blocks and instructions (counted separately) that you are willing to translate. And surprisingly this works out very well, because the code you care about is higher up in the executable layout! I tried experimenting with a lot of combinations of code block and instruction counts, and after a certain number you don’t get anything back anymore. Maybe you get it back in startup time, but for my projects I don’t care about that — especially if I’m using binary translation with an expensive compilation stage.
Every register-change inevitably has to be saved in the emulators register file. This means extra instructions, extra stores and while it can be improved upon by optimizers, especially in tight loops, at the end these stores have to be committed only for most of the stores to be ignored afterwards.
There are significant differences in compile times between compilers. The fastest so far has been clang with -fuse-ld=lld. Reducing the time it takes to compile the translated code is the biggest hurdle right now. Right now small executables finish almost instantly, but larger executables have an exponential scaling in compile times. For example a C++ example program with exceptions and everything enabled will take a few seconds to compile, with optimizations enabled (-O2). Less so with optimizations off (-O0). I made it so that you can pass CC and CFLAGS to the process that hosts libriscv to control these things.
On the flipside, you can just turn it off and get instant execution. It depends on your needs.
I’m sure many of you are wondering why I’m not just generating assembly directly. That is completely out of the question. It’s not portable, it’s going to take months and maybe years to get everything right. It does have nearly instant linking time. But, I will miss the big picture all the time. I won’t be able to generate optimal assembly at all without reinventing the optimization wheel. The most common way to optimize assembly is peephole-optimization. Instead, there is now a portable solution to binary translation that you can tune optimization levels for. It will work on your ARM server, your desktop and your mobile phone! For iOS and Nintendo Switch you can just turn off binary translation.
If any stage of binary translation fails the emulator will clean up temporaries and fall back to regular emulation. I’m not sure if it’s the right thing to do, but there is a boolean getter in Machine that tells you if the currently running program is using a natively compiled shared object or not.
I actually don’t have any set plans other than sitting on this feature branch for a week to see if it’s stable and if the compilation times can be improved upon. I was considering trying out libtcc, so I could translate and compile to amd64 directly to memory. I already tried using it to manually compile the output from binary translation and loading that, however there is clearly some miscompilation going on as the guest fails on every program I tested. Not sure if I have some unclear C code or TCC is just being a bit buggy, but it compiles without warnings and all the public symbols are present. So I have put it on a shelf for now. The first thing I would check is if it miscompiles sign extension casts, for example:
(int64_t)(int8_t)value. Or if it misbehaves when I’m using unions to modify floating-point values.
There are also still many improvements I can do to simplify the generated C code. I tried caching function prologues/epilogues, but I failed to see much of an improvement in actual run-time which just reaffirms that C compilers are really good, or maybe its the hardware too. I’m glad I chose C in the end because it’s a language where you can seriously reduce the code size as-needed, as that was one of its primary goals in the beginning. That’s why you have the ternary operator, ++ and single-char operators (^) instead of words (xor). It’s perfect for my use-case and I can probably improve on the compile times somewhat by reducing the overall code size.