Making remote VM function calls easy, transparent and reliable
Hello, it’s insanity hour again. Last time I used inline assembly to speed up building block functions like memcpy, memset, strcmp and friends as system calls. memcpy-as-a-service. It gave an unbelievably huge speedup at the cost of learning how to make very complicated inline assembly, which could produce nasty surprises if not done correctly. Which it wasn’t.
Before that again, I tried to make my game scripts communicate with each others transparently. Imagine you have 1000 levels in a game, and you want to have some common code as well as some shared mutable memory in order to keep more stuff directly in the scripts. For example, how enemies act could be stored in a single script shared and accessible by all levels.
Also, imagine that you want to remember certain things like if the player has opened a certain door, and you want to store that programmatically in the script instead of in the engine.
struct GameState
{
bool key_for_sand_dungeon = false;
...
} state;
void player_got_sand_key() {
state.key_for_sand_dungeon = true;
}
Simplified, but exactly like the above. This code you would find in a script that always exists and is accessible to all levels. So, in order to get the sand key you can remotely call the player_got_sand_key()
function from the level. It would execute in another script, which we can think of as shared. That’s it. Now he has the key no matter which other levels he goes to, and the engine only facilitates the scripts and the calling between them.
If I lose you a little along the way, don’t worry about it. This stuff is complicated and dense. Oh, and if you were wondering, my so-called game scripts are RISC-V static ELF binaries. It allows me to use systems languages for game engine scripting, and it works very well, thank you!
The dream
I’ve done repeated attempts until now to make communication between programs automatic and natural. The dream has always been to have a completely separate program that you can interact with as if it was code in your current script, passing arguments and data, and returning the same way. In the last attempt I made, I used a common shared memory area (that I still have now) in order to pass arguments. I had a simple wrapper that took a structure and wrote it to the area, and that was that. It would look the same on all machines, and for plain data it would work fine.
/* This incantation creates a callable object that when called, tells
the engine to find the "gameplay2" machine, and then make a call
into it with the provided function and arguments. */
ExecuteRemotely somefunc("gameplay", gameplay_function);
/* We will need to put the struct on a shared memory area, so that it
will be visible and writable on both sides. */
SharedMemoryArea shm;
auto& some = shm(SomeStruct{
.string = "Hello 123!",
.value = 42
});
/* This is the actual remote function call. It has to match the
definition of gameplay_function, or you get a compile error. */
int r = somefunc(1234, some);
print("Back again in the start() function! Return value: ", r, "\n");
print("Some struct string: ", some.string, "\n");
As you can see above, there’s a few incantations going on in order to make a remote function call.
I think it can be solved in a better way. I especially think it’s not easy to reason about what happens to all the things going back and forth. What if it becomes a point in the codebase that you always have to go back to and re-check when something breaks? Or you accidentally pass an invalid pointer around? What if you want to use modern C++ facilities across the boundaries?
From now on I will use the word script to refer to a RISC-V static executable that is loaded and used in my game engine as if it was any other scripting language. This is largely just an internal wording I have kept because I used to use Lua.
Execute traps
In order to understand what’s going to happen we must talk about execute traps. It’s basically a way to decide what happens when a specific part of memory is about to be executed on.
I recently merged a new variation of execute traps in my RISC-V emulator codebase, and this time they don’t affect performance as they are not in the fast path. They will only be considered if you are 1. jumping to a new unseen un-optimized execute segment (which processes and activates it), or 2. fault when jumping to a page without +execute permission.
Specifically, we are interested in the second condition, with a twist:
Handling jumps to any page in a specific, but remote area of memory.
After merging the feature I started thinking about how can you Make Remote Execution Look Completely Seamless Part IV or whatever it is now. Let’s start with just having a known address that is completely outside of the range of the current levels image space, and go from there. The image space is the area that a program will realistically be using without special calls to mmap. Eg. the executable code, followed by read-only data, followed by data and zeroed data, then followed by stack and heap pages. This is without ASLR. The address space is gigantic, and it should be easy to keep two programs separate without worrying about overlaps.
So, let’s put each and every level script at the same place in memory, let’s say it starts at around 4MB. And then we put another program at 2GB+. That should keep both programs within the nice 32-bit space (for the purposes of saving on instructions on RISC-V), and also far away from each others. So, how can we know what space a program spans? Well, a userspace emulator can control the entire address space and it always sets up the initial environment, including stack and heap (BRK + mmap).
In order for this to work we absolutely have to make this convenient outside of the engines code too. Ideally we should have it fully automated in the build system.
GNU’s binutils (binary utilities)
The ld linker has a wealth of options and one of the more obscure ones is the --just-symbols
command-line option. It allows you to import just the symbols of a given binary, and nothing else. That said, it’s not just the address, it’s the type and size too, so it’s quite nice that way. See: https://refspecs.linuxbase.org/elf/gabi4+/ch4.symtab.html
Now, remember: the goal here is to take symbols from our to-be-shared script and make them available in each level. We can’t just import the whole shared script as-is, as it contains everything: int main(), strcpy etc. Thankfully, we can use objcopy to reduce it down to just the symbols we want, by creating a new binary that only contains symbols, and only the ones we want:
add_custom_command(
TARGET ${NAME} POST_BUILD
COMMAND ${CMAKE_OBJCOPY} -w --extract-symbol --strip-symbol=!${WILDCARD} --strip-symbol=* ${CMAKE_CURRENT_SOURCE_DIR}/${NAME} ${NAME}.syms
)
This CMake snippet executes objcopy with wildcard-support. It reduces the output down to just symbols, and it removes every symbol that is not matching the wildcard. Eg. if your shared script is named gameplay, we would use a *gameplay*
wildcard, and add a ! before it, negating the stripping.
13: 00000000500043e4 140 FUNC GLOBAL DEFAULT 1 gameplay_function
The resulting binary has a global symbol of type function whose name contains gameplay. This is fully automatic!
Now we can take this new binary and import it into each level by using --just-symbols
:
add_dependencies(${NAME} ${PROGRAM})
set_property(TARGET ${NAME} PROPERTY LINK_DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/${PROGRAM})
target_link_libraries(${NAME} -Wl,--just-symbols=${SYMFILE})
The resulting level scripts should now have the gameplay_function
listed above:
$ riscv64-unknown-elf-readelf -a level1.elf | grep gameplay
84: 00000000500043e4 140 FUNC GLOBAL DEFAULT ABS gameplay_function
Yay!
Automatically imported function symbols
So, what is all this good for, exactly? I already have remote function calls, and so on. Well, I think they were clunky. I also had a larger dream of using a shared machine as the games complete save state. Indeed, the VM itself, could be the save game. This has pros and cons, but I really like the idea and I want to try it out. One negative could be that in order to update the shared script, we would then also need to re-link all the levels in order to update the symbols. Perhaps it can be mitigated with some tricks, but the linking is fairly fast for files that are ~125k in size, so for now I am not going to spend time on it.
Now we have all the build system bits in place. We now get relevant symbols from one shared script automatically added to all levels. This allows us to do this:
struct SomeStruct {
char string[64];
int value;
};
extern "C" long gameplay_function(float value, SomeStruct& some);
Just a regular function definition. It doesn’t even have to be extern "C"
as the call will be calling-convention and language agnostic. This builds and links just fine.
Performing the remote call
Now that we have working programs that can reference each others, we need to implement the remote call.
In order to handle the call we can use an execute space protection fault handler that checks whether the jump was outside of the levels image space, and if it was we can try executing a function in other scripts. After the call we can return back to executing in the level script using the return address (RA register), ultimately making it function like a regular function call.
This is the procedure I came up with:
- Take the 2GB+ address function that is shared
- Make a regular C/C++ function call
- Experience an execution space protection fault
- The fault calls the custom handler instead, and we check if the address is outside the level scripts image space (otherwise it’s a regular fault).
.. and now it gets complicated: - Copy all the RISC-V general-purpose registers to the destination script (including floating point registers).
- Set custom write- and read- page fault handlers that checks the address in order to loan pages from the levels script to the destination script. But, we are not actually putting the pages in the destination, just using them each time a fault happens. Faults happen when the destination machine tries to read and write to areas that we strongly suggest belongs to the level script that is currently calling.
- Execute the function call in the remote script
- Forward the return registers back to the caller script
And that’s it. With this we have a way to temporarily hook up the ability to read and write memory from the source level script from inside the destination script, and then properly take it down again afterwards.
The source and destination scripts will see this whole operation as a regular function call, with only one caveat: You cannot send a pointer back into the caller script after the call ends, so the caller must provide somewhere for the remote script to store it.
struct SomeStruct {
char string[64];
int value;
};
This is why the string is a 64-byte array — although it could be anything, really. Let’s try that anything and see how far we can take it.
Using C++ containers instead of buffers
If it really is agnostic and fully transparent, couldn’t it just be a std::string
or a std::vector
? Alright, let’s see what happens.
On the level script:
struct SomeStruct {
std::string string;
int value;
};
extern "C" long gameplay_function(float value, SomeStruct& some);
...
SomeStruct ss {
.string = "Hello World",
.value = 42
};
// Make a remote call??
long r = gameplay_function(1234.5, ss);
print("Back again! Return value: ", r, "\n");
print("Some struct string: ", ss.string, "\n");
On the destination:
PUBLIC(long gameplay_function(float value, SomeStruct& some))
{
print("Hello Remote World! value = ", value, "!\n");
print("Some struct string: ", some.string, "\n");
print("Some struct value: ", some.value, "\n");
some.string = "Hello 456!";
return value * 2;
}
Produces:
>>> [gameplay] says: Hello Remote World! value = 1234.5!
>>> [gameplay] says: Some struct string: Hello World
>>> [gameplay] says: Some struct value: 42
>>> [level1] says: Back again! Return value: 2469
>>> [level1] says: Some struct string: Hello 456!
So, we actually got Hello 456 back. I assume it wouldn’t work if it had to resize the string. I made the returned string much longer, and as predicted:
Script::call exception: Possible double-free for freed pointer (data: 0x511321c0)
>>> [0x400670] 00000073 SYS ECALL
>>> Machine registers:
[PC 00400670] [RA 0040270C] [SP 0151BD90] [GP 0041CA38] [TP 0041CC90]
[LR 00000800] [TMP1 00415D2C] [TMP2 00000012] [SR0 0151BDB8] [SR1 00000005]
[A0 FFFFFFFFFFFFFFFF] [A1 00000025] [A2 00000024] [A3 00000000] [A4 0159F1C0]
[A5 0151B5B4] [A6 00000002] [A7 0000023D] [SR2 004020DC] [SR3 00416000]
[SR4 00415E78] [SR5 00000000] [SR6 00000000] [SR7 00000000] [SR8 00000000]
[SR9 00000000] [SR10 00000000] [SR11 00000000] [TMP3 00000001] [TMP4 00000000]
[TMP5 0151BE31] [TMP6 00000000]
Program page: [0x0000000000400670] Readable: [x] Writable: [ ] Executable: [x]
Stack page: [0x000000000151BD90] Readable: [x] Writable: [x] Executable: [ ]
Function call: start
-> [0] 0x0000000000400224 + 0x44c: exit
-> [1] 0x000000000040218C + 0x580: start
-> [-] 0x000000000040218c + 0x000: start
The host-controlled heap-as-a-service sees the freed pointer was never allocated by the caller script (as it came from the remote script) and gives us a wrist slap. Very nice… I guess. Control test:
Source:
ss.string.resize(256);
ss.string = "Hello";
long r = gameplay_function(1234.5, ss);
Destination:
some.string = "Hello 456! This string is very long!";
Output:
>>> [level1] says: Some struct string: Hello 456! This string is very long!
Yep. This works because the compiler is the same. Finally, I removed the extern "C"
, making the function a regular C++ function, and it worked! I’m sure I can continue to improve on this feature, but I am feeling like things are starting to come together.
> median 91ns lowest: 88ns highest: 99ns
>>> Measurement "Call remote function" median: 91 nanos
Internal micro benchmarks shows it took 91–17 (benchmark overhead) = 74 ns. That’s not the real time spent when running the actual game, but it’s meaningful as a comparison relative to the other remote calling implementations I’ve had over the years. Around 30 ns is spent loaning the fault handlers and putting the originals back in place after. I think the only way to drastically improve this functionality is to use a custom std::function implementation with a fixed-size capture storage in order to avoid heap allocation. *sigh*
… In the name of performance I brought back the custom capture storage backed functions. A huge boost in performance:
> median 58ns lowest: 56ns highest: 64ns
>>> Measurement "Call remote function" median: 58 nanos
Now that the basics are covered, let’s try to make general reads and writes to the remote script work without even having to do remote calls.
Remote reads and writes
We already support page read/write fault handlers, and so if we set them properly in the caller script, we should be able to directly read and write to the shared scripts data structures. This is important not just because we can extern structures, but also because we need it in order to call member functions!
I won’t bore you with the details of setting up the fault handlers, but I wrote a simple C++ structure to test with:
struct Gameplay {
bool action = false;
void set_action(bool a);
bool get_action();
};
extern Gameplay gameplay_state;
With the structure stored and the member functions implemented in the remote script:
Gameplay gameplay_state;
void Gameplay::set_action(bool a)
{
print("Setting action to ", a, "\n");
this->action = a;
}
bool Gameplay::get_action()
{
return this->action;
}
The general rule is that everything needs to contain Gameplay or gameplay, for the automatic symbol imports, so for the member functions to be visible we need to name the structure something with Gameplay too.
gameplay_state.set_action(true);
print("Action: ", gameplay_state.get_action(), "\n");
gameplay_state.set_action(false);
print("Action: ", gameplay_state.get_action(), "\n");
Now we can try to use it from our level script. I don’t know if I should say expected, but it seems to Just Work ™️:
>>> [gameplay] says: Setting action to true
>>> [level1] says: Action: true
>>> [gameplay] says: Setting action to false
>>> [level1] says: Action: false
That’s quite something. Not sure how to feel about it. Just seems like I did something crazy and unexpected. I think that concludes remote calls!
> median 62ns lowest: 60ns highest: 80ns
>>> Measurement "Call remote C++ member function" median: 62 nanos
Thanks for reading!
-gonzo