Using Fast Virtual Machines In A Game Engine

17 min readMay 14, 2020

What are virtual machines, and why so many?

First off, what is a virtual machine?

A virtual machine is a CPU or system emulator, which would handle logic completely separate from the host machine, in this case the game engine. It has its own separate memory and no way to access the outside world unless it uses system calls. System calls are the only way for a virtual machine guest (the program running inside the machine) to do things it normally wouldn’t be able to do, and only if the host allows it! One example would be to print text to the terminal, which requires the game engine to write to stdout. The virtual machine guest cannot do that by itself, and it would only be able to write to stdout if there was a a system call that when handled did so. The guest running inside the virtual machine can do its own calculations, read and store its own memory, but if it wants to do some action that is visible on the outside, then that action would have to be done by the engine when handling a system call. Virtual machines are like tiny computers which we can have many of.

The game engine calls into the virtual machine, which needs help to print something in the terminal, so it invokes a system call. The system call gets handled by the engine, which makes a familiar system call to the operating system in turn.

Why would you use a virtual machine to run the engines script?

The virtual machine is a sandbox that protects the engine from silly mistakes that don’t matter for the game overall. If a particular back alley level is misbehaving it shouldn’t stop the players of your game to be able to experience the rest. The virtual machine running the script could even run out of memory or run in a loop, and the game would be fine.
The API to the scripting language is what you make it. If the API is simple, you could maybe even get non-technical people to contribute changes to the script. The script sources are generally very simple regardless of language.
We can have the game running while we make changes to the script, and by reloading the level we can see any changes we made applied. This makes it possible to iterate on and fine-tune levels quickly.
Embedding scripting languages into a game engine is usually fairly straight-forward. The same hold true for interfacing with the script, which is the separation between the game engine and the games logic. Your engine becomes a service.

One of the reasons I switched to Lua from TCC back in the day was because when running code natively the engine would crash or behave in mysterious ways. For me it was important to be able to keep the script separate from the engine, and try to make everything as straight-forward as possible.

In this series I am writing about how I am using a RISC-V emulator to add scripting functionality to a game engine.

Why would you use a RISC-V emulator?

Most of the events happening in the engine are smaller functions that do most of the work in system calls, which means JIT is less important than the overhead of calling into the VM as well as handling system calls.
You can give calls into the virtual machine a budget, by limiting the number of instructions allowed between each call. This prevents loops and makes sure the frames budget is kept. The same applies to memory, allocation, thread restrictions. And so on.
We can use any programming language we want that has a toolchain that can produce RISC-V machine code, such as Zig, C++, …
Modern languages have strong type safety, compared to traditional scripting languages which don’t. We can also choose to have no garbage collection, and avoid the higher overhead of traditional scripting languages.
When using the same language on both sides, we can share constants and other data between engine and binaries, reducing duplication and preventing stale enums and constants. If you are using C++, it has compile-time programming and is probably also the same language as in the engine, further reducing the need for translation between script and engine.
We can make direct C++ function calls into the virtual machine using C++17 lambda folds. We can get typed arguments passed to system calls using structured bindings. We can call a function by its address directly, which means that the script can pass its own addresses directly to function calls instead of strings. We can view and manipulate the VM memory directly, and create our own abstractions for our needs. All of these are low overhead abstractions.
We can have one machine for the GUI, one for gameplay events, one for the current map and one for game saves, as an example. We can save a machine and restore it later. We can also safely use memory sharing between machines, which means they can have a portion of memory that looks the same on all machines, at all times, with protections.
We can call into one machine from another, using the language of our choice, which means you would only have to link in the absolute minimum script functions for each level, leaving all the other script implementation in other machines. This means that even when the script is compiled ahead of time, the build times as you iterate on a level will be very low.
Due to having precise control of the machine language, there is potential for limiting allocations and storage in the VM itself beyond normal. We can for example offload most state to the in-engine objects themselves, or capture storage in engine subsystems, such as timers. Imagine, for example, that you have a custom Function (std::function-like) class, which can store 4 register-sized values (one of them being the function pointer itself). Now pass that storage to the engine along with a timer duration. In the system call handler we will copy this storage into the engine-timers capture storage, and it will be passed back to the caller inside the VM once the timer deadline is reached.
The RISC-V emulator is written using standard C++17. Compiled binaries are many times faster, ranging from 50% to 2250%, than Lua.

And what are some negative aspects?

The lack of (costly) JIT means that if you are running lots of normal code the performance gain by may be lost after a certain threshold. I don’t have exact numbers, but the extremely cheap system calls means that this a solvable problem whenever it’s encountered.
You will usually compile the code AOT, unless you can generate RISC-V code on the fly. Although not recommended, as modern build systems combined with compiler optimizations still make fast iteration possible.
Limited debugging possibilities. The RISC-V emulator has an instruction-by-instruction stepping mode and can break on functions, reads, writes and so on. But that is fairly advanced and cumbersome debugging for most people. If you build your script API on a strong foundation that makes it easy to validate all in- and out- arguments you might just catch most problems very early on. For example, in my engine all arguments that refer to an object can only be done through a wrapper that does extensive checks. Same goes for all other entities.
To use RISC-V as a scripting backend you will have to write a tiny runtime environment and build system, as well as quite a few system calls to accelerate some common operations. I’ve already done the latter, but there is no repository ready to go for the former. It’s on my TODO list though.

With that out of the way, I will continue to write a bit about what I’ve done since part 2.

Thinner events

I realized that I didn’t have to store event callbacks in the engine as strings. Normally events are strings such as “some_enemy_death”, which requires looking up the event function address. When using strings like that we can take an object with us across levels and the event will still have meaning as long as that function still exists. The same is true with events as function addresses, provided that you have a machine not tied to the map that acts as a shared library. All we have to do is make it possible to tie an event to another machine. So, an event would be specified in JSON as something like:

"on_death": ["gameplay", "my_death_function"]

Where the engine translates “gameplay” to one of the loaded machines with that name, and the death string into a function address at load time, discarding both strings.

Now we have a low-overhead event system that checks all the boxes. We needed the ability to execute shared code, which is solved by having separate machines that can store shared functionality. And instead of strings internally in the engine we can make a direct call with no lookups.

Far calls: Calling into other machines

By implementing a system call that takes a machine name (hashed), function name (hashed) and arguments, we can make a regular function call into other machines in the game engine. This sharing of code is part of the solution to avoiding having to link lots of things into each levels script.

>>> Measurement "Function overhead" average: 7 nanos
>>> Measurement "farcall()" average: 71 nanos
>>> [towntest] says (27): Farcall overhead: 64 nanos
>>> [towntest] says (21): 2000 farcalls: 70 ns

First working farcalls measured from inside another machine. Fairly high cost, considering a single VM function call was 7 nanos (at the time of writing this). More than my estimate of 30 nanos. This is what a farcall looks like with arguments passed to the other machine:

dm::farcall(GAMEPLAY_MACHINE, "empty_function");

Normal function arguments follow after the function name. One idea is to use the already existing public API function names stored in a text-file, which we can read and create CRC32 hashes from. The functions we are calling would not be visible anyway if they were not listed in the symbols text-file that is part of the build system, so this isn’t such a round-about way to solve this. We shave off a healthy ~37% of the run-time by using compile-time hashes:

>>> Measurement "Function overhead" average: 7 nanos
>>> Measurement "farcall() overhead" average: 42 nanos
>>> [towntest] says (27): Farcall overhead: 35 nanos
>>> [towntest] says (21): 2000 farcalls: 39 ns

By hashing the public symbols for each binary we can directly translate a hash to a function address. Now the cost of a farcall is less than half the cost of a regular function call into a LuaJIT VM. In addition, having enum values for the machine IDs is not very good API, so let’s just make that a string-hash as well. Now the farcalls look like this:

FARCALL("gameplay", "unblock", tid);

Very promising! So this allows one VM to call into another without having to change anything in the engine. For me this was the alternative to dynamic linking with the added benefit of extra separate address spaces. If you call into another machine and it crashes, you will just be returned back to the caller machine and get a negative return value. With this I can build scalable shared APIs that have near-zero startup cost, due to not having to translate the machine code.

The reason why using CRC32 is so beneficial is because with constexpr programming in C++ we can create these hashes at compile-time. And passing these values from the script to the engine then requires a single register value.

Guard pages and sharing memory

I didn’t want to create lots of guard pages to trigger protection faults with when I can just share the same guard page across all the machines. So, I will instead use the same shared guard page in various places. This will help me catch more bugs:

auto& mem = machine().memory;
// Install our shared guard-page around the
// shared memory area, put the shared area in the middle.
mem.install_shared_page(shared_pageno-1, g_shared_guardpage);
mem.install_shared_page(shared_pageno,   g_shared_area);
mem.install_shared_page(shared_pageno+1, g_shared_guardpage);
// Let's guard the heap as well.
mem.install_shared_page(heap_pageno-1,   g_shared_guardpage);

The zero-page is automatically made inaccessible by default. This got me thinking about machines sharing more memory, multiple execution units and so on. I decided to not implement multiple execution units because of the headaches I would have deal with. Instead I am just creating many machines that share the same executable, so all the read-only memory will be the same, and at the same addresses. Combine that with a shared writable area and anyone can use a that area like a scratch space for remote function calls. For multiple execution units, instead there should be one machine per long-running task, because machines are quite tiny.

I have also toyed with the idea of creating an ELF loader that stores shared read-only pages which can be stored in a cache where the key is the binary. So, any time you want to create or re-create a machine you would have that list of shared pages at the ready. At that point you could just spin up a new machine for every task as the setup cost is inconsequential.

Long-running processes

Since machines can run with a given time-budget, and it was easy to add support for multiple machines in the engine, it would be interesting to see what having long-running processes would be like. These are processes that are doing some level of work, and sits idle waiting for something or more work in between.

For example, if you had a mini-game where time was running out, the logic for that could be running in its own machine. It would regularly update the counter on screen even when changing levels/maps. It would stop the mini-game after the time ran out, and show a message-box, for example, then stop.

The machine could share all code with other machines, so that each machine doesn’t have to have its own copy of all read-only and execute-only memory. Then, the difference would just be which task the machine was doing at a specific time. Each machine could be delegated a low number of instructions per frame to make sure that they didn’t blow any budgets.

The big problem with long-running processes is interrupting them to deliver an event, so that problem has to be solved for these processes to be really viable and easy to use.

Interruptions

Interrupting VMs turned out to be fairly straight-forward. By creating a new function called preempt that stores the current registers and some other state, then performs the function call, and then restores these registers after, we can even have a machine interrupt itself. This makes the function very dependable. A far preemption was measured to cost 50ns, which is a function call from another running machine. Unlike vmcall, which resets the stack pointer on each call, preempt has to make some stack room on the current stack to not accidentally overwrite something the current machine was using.

Machines that are continually running will be inside an event loop, and go to sleep when there is no work. It then relies on preemptive (interrupting) calls from the game engine and other machines to add new work. Each frame these machines will get to run a few thousand instructions, or until they sleep again, whichever comes first. We don’t need to have separate binaries for each machine, and in fact we can just use the same binary for all of them. They will also run the same event loops, and the only difference will be in which events they handle.

At a high level (and in the actual implementation) the code looks like this:

void event_loop()
{
  while (true) {
      for (auto& ev : events) ev.handle();
      asm("ebreak" ::: "memory"); // HLT, basically
  }
}bool add_work(const Events::Work& work)
{
  for (auto& ev : events)
      if (ev.delegate(work)) return true;
  return false;
}

What we are doing here is that instead of using a lockless ringbuffer it is just two simple ring buffers that the machine swaps between. If the machine is using one, the interrupts will use the other and vice versa. We can have only one CPU and one interrupt at play.

In the first draw I used function pointer + data, which is usually fine, however you can’t pass any data that is not readable by the other side. We also want to avoid any allocation. This rules out using std::function completely. The only thing left is to use an embedded variant of std::function I found written by Ambroz Bizjak that never allocates on the heap, and has a configurable capture-size, which we will increase to 3 registers for a total of 16 bytes for the whole structure. With that we can conveniently and safely pass work on to remote machines:

static void execute_remotely(uint32_t mhash, Function<void()> func)
{
  SharedMemoryArea shm;
  // copy function to shared memory area
  auto* work = shm.push(std::move(func));
  // send work to another machine
  dm::interrupt(mhash, crc32("add_work"), work);
}
#define REMOTE_EXEC(mach, ...) \
          execute_remotely(crc32(mach), __VA_ARGS__)

When the remote machine receives this work it will be completely self-contained. Safe and reliable way to execute remotely.

int x = 44;
float y = 55.0f;
uint32_t z = crc32("something");
REMOTE_EXEC("events",
  [x, y, z] {
    dm::print("Hello Remote World! "
        "x = ", x, " y = ", y, " z = ", strf::hex(z), "\n");
  });

We can observe the machine events output something in the logs:

>>> [events] says: Hello Remote World! x = 44 y = 55 z = 9da31fb

Crashes and recovery

Most VMs in a game engine will simply be running callbacks and helper functions where there is no continuity, and so errors and crashes matter less because all activity is short function calls that start and end in the same place, rarely beyond just manipulating engine state. Returning from the function leads to the machine stopping, passing on the return value. Very little static data is used, the stack is reset on each call. Crashes will very rarely cause future problems. They might have side-effects in the event-chain they belong to, such as making a quest stall, but they won’t do serious damage unless the engine receives illegal input values and doesn’t validate them properly.

For long-term processes it gets harder, but they can be recovered from using machine resets. Let’s try to execute some remote “work”:

REMOTE_EXEC("events", nullptr);REMOTE_EXEC("events", [] {
    dm::print("Hmmm... Still there?\n");
  });

It should successfully crash on jumping to zero, reboot the machine, and then any events that were queued previously should be gone. Output from the logs:

[18:16:36] ERROR: Script::call exception: Execution space protection fault (data: 0)
[18:16:36] ERROR: >>> Machine registers:
...
[18:16:36] ERROR: Function call: (null)
-> [0] 0x00000000 + 0x000: (null)
-> [1] 0x00123bb8 + 0x070: event_loop
-> [-] 0x00000000 + 0x000: (null)
**** The machine "events" has crashed, rebooting... ****
>>> [events] says (35): 0 Running event on remote machine!
...
>>> [events] says (35): 1 Running event on remote machine!
...
>>> [events] says (35): 2 Running event on remote machine!

It was really hard figure out how to recover from this because there was a timer in the engine that was executing something in the recovered machine. It had captured some data that had meaning until the machine reset. It ended up being recoverable because I could just copy the whole std::function-like storage into the engines internal timers, and then copy this data back into the machine when it was time to run the timer event. In the log above you can see the increasing number followed by “Running event on remove machine!”, which is a timer that continues operating after the crash.

There are actually surprisingly many things that can be passed to remote functions as raw pointers as long as the machines run the same binary: Functions, read-only data and string literals. Anything else goes in the shared area. And these will also work after a crash.

Fragmentation

A problem that almost blind-sided me was dealing with fragmentation on the machines that handle game logic, however after some thinking it turned out to be a simple matter of resetting the machine each time the level changes. This is a relatively cheap operation. It also frees dangling threads and ensures the machine memory is pristine, meaning any future problems you encounter would have to be solely due to errors in the level you are in. We don’t have to track allocations used by objects, or deal with hanging threads as everything is just gone.

Virtual machines in practice

Let’s take one of the current machines in the game engine as an example: The gameplay machine. This will be a list of things that the engine does. My CRC32 scheme could easily be replaced by just using strings for function names and such, of course. It’s just a convenient optimization.

The game engine starts up, reads a JSON file which describes virtual machines. It then finds gameplay in there, followed by a RISC-V binary filename and an API symbols filename (which is just a list of function names that are public). Now it does the following to make the machine ready to use:

Translate everything it can into CRC32 hashes, such as the API symbols by looking into the symbol table of the binary. This is used to make everything go fast. We can translate both hashes and addresses back to function names if we need to print errors and such.
Create the machine, load the binary and attach all the system call handlers. This is the public API to the engine.

machine.install_syscall_handlers({
  {ECALL_SELF_TEST, dm_self_test},
  {ECALL_ASSERT_FAIL, dm_assert_fail},
  {ECALL_WRITE,     dm_write},
  {ECALL_MEASURE,   dm_measure},
  {ECALL_FARCALL,   dm_farcall},
...

Set up guard pages to be able to catch bugs early on, and shared memory areas so that the engine and even other machines can pass arguments using a common scratch space that is visible everywhere using the same virtual addresses.
Run the initialization phase which is just setting up the standard C runtime environment: Initialize some registers, clear BSS, call global constructors. Finally check the “return value,” which is just a register value.
If everything has gone OK, maybe vmcall a self-test function in the machine and use a dedicated system call to verify a line of communication is up and running. Or just check the return value of the call.

The machine is now ready to run. Since this machine will be used for simple short and sweet function calls, it can be reset at any time should something go wrong. And we don’t have to care about the register-state once a function call has completed, which means the call setup and teardown will be quite fast.

An example vmcall

An entity in the engine just died, and will be calling myobject_death with itself as argument in the gameplay machine. Those strings have been converted to CRC32 beforehand, and so we all we have is hashes. The hashes will be looked up in fixed_maps and converted to a machine reference and a VM function address. Using that address we can make a vmcall into the machine, like so:

machine.vmcall<8'000>(function_address, object_address);

The object address is an offset into the object arena storage, offset to make zero an invalid object, for fast validation and lookup. Inside the machine script we have this:

void myobject_death(Object& obj)
{
    /* This enemy is particularly annoying because he opens
        a messagebox on death. */
    dm::message("Ugh!");
}

vmcall will set the return address to a function that stops the machine, so our public function call above will return into that, and then stop. Magic. And more crash-resilient! All vmcalls reset the machine registers to a good state.

The template parameter to vmcall is the instruction counter limit, which acts as a timeout, preventing loops and that functions don’t take too long to complete.

Timeouts and threads

We can make the emulator throw an exception when it runs out of instructions for a particular call. Using that we can block any threads that were currently running, allowing us to resume them later, and that way have automatic pausing and resuming, as long as we have a way of finding the threads later. I deal with by grouping all threads using numbers, so that each number is one dedicated use for a thread. For example, each thread that gets to run once each frame get its own number.

Future research

Due to having direct memory access to the machine memory on the game engine side I am looking into sharing read-only pages which would contain useful information about the current level. This way I don’t have to provide lots of system calls for querying information about the current level or pass any information at machine startup. The information would be accessible at all times on all machines at no cost and no duplication of effort. Taking this a step further I could use this to make all entities available, but I don’t know yet how it would work in practice. Something to think about for sure.

Unfortunately, I could write about this forever, but who on earth is going to read all that stuff! So that’s it for me.

A screenshot from our WIP game. Wish us luck.

-gonzo