Using C++ as a game engine scripting language

An adventure in game engine programming.

This all started when I suddenly decided I would write a RISC-V CPU emulator.

I was thinking a lot about how it would fare against Lua 5.3, which was the scripting language I was using in my game engine. I had experiences where it was really slow, and it made me curious if I could imitate Luas excellent APIs and measure differences in a meaningful way. Eventually I just realized I would never stop thinking about this and just do it. After all, I had a solution and I was looking really hard for a problem.

First, I had to create a way to call into the virtual machine and get it to do stuff. I created , a function that would point to a specific address and start execution. I had some weird problem with accessing a and I realized that the normal libcs call destructors after exiting main. Instead of removing the libc and using my own I put an instruction in main and made the handler stop execution. This would be enough to test various things against Lua. Unexpectedly the performance was horrible, and it would take a long time for me to device ways to simplify the instruction decoding and execution. Page caches, instruction cache, micro-benchmarking every change. It was also the first time that likely/unlikely macros made a huge difference.

Now we have a clunky function (vmcall) that works, kind of. The API is not great, it supports a few integer arguments. But what about passing floating-point values? Any number of integer values? Strings? Structs?

template <int W>
template <typename... Args>
constexpr inline void
Machine<W>::setup_call(address_t call_addr, Args&&... args)
{
cpu.reg(RISCV::REG_RA) = memory.exit_address();
int iarg = RISCV::REG_ARG0;
int farg = RISCV::REG_FA0;
([&] {
if constexpr (std::is_integral_v<Args>)
cpu.reg(iarg++) = args;
else if constexpr (is_stdstring<Args>::value)
cpu.reg(iarg++) = stack_push(args.data(), args.size()+1);
else if constexpr (is_string<Args>::value)
cpu.reg(iarg++) = stack_push(args, strlen(args)+1);
else if constexpr (std::is_floating_point_v<Args>)
cpu.registers().getfl(farg++).set_float(args);
else if constexpr (std::is_pod_v<std::remove_reference<Args>>)
cpu.reg(iarg++) = stack_push(&args, sizeof(args));
else
static_assert(always_false<decltype(args)>, "Unknown type");
}(), ...);
cpu.jump(call_addr);
}

Using a C++17 lambda fold we can iterate over a parameter pack, look at the types at compile-time, and do the right things. The max number of instructions to execute is a template parameter to ensure we don’t get stuck in an infinite loop inside the VM. Now it’s a regular function call. Ish.

What about the other way around? The VM will need to be able to request services from the emulator, such as printing text to stdout. For that RISC-V already has the (also ) instruction, which invokes a system call. With some inline assembly magic and suddenly we can make things happen:

inline long
syscall(long n, long arg0)
{
register long a0 asm("a0") = arg0;
register long syscall_id asm("a7") = n;
asm volatile ("scall" : "+r"(a0) : "r"(syscall_id)); return a0;
}

On the other end of the system call is the system call handler, which is a callback function installed into the RISC-V emulator. How do you easily and safely handle this string passed to you from the VM? What about other kinds of buffers? At first I implemented a function called however it was clunky to use and caused me to write a lot of code to safely use it. Additionally, it required me to set aside a buffer to copy into, even though I most of the time just wanted to look at the memory. So, I implemented and , which both have fast-paths that doesn’t allocate when the memory being looked at doesn’t cross a page-border.

template <int W>
std::string Memory<W>::memstring(address_t addr, const size_t max_len) const
{
std::string result;
size_t pageno = page_number(addr);
// fast-path
{
...
// early exit
if (reader < pgend) {
return std::string(start, reader);
}
// we are crossing a page
result.append(start, reader);
}
// slow-path: cross page-boundary
...

At this point I was really starting to leave Lua behind. All my benchmarks at that time were 2–6x faster than Lua. What I did not test yet was heap allocations, string passing and threads.

RISC-V self-test OK
* All benchmark results are measured in 2000 samples
libriscv: array append => median 43ns lowest: 42ns highest: 64ns
lua5.3: table append => median 200ns lowest: 193ns highest: 258ns
libriscv: many arguments => median 159ns lowest: 150ns highest: 199ns
lua5.3: many arguments => median 737ns lowest: 721ns highest: 818ns
libriscv: integer math => median 44ns lowest: 42ns highest: 61ns
lua5.3: integer math => median 247ns lowest: 242ns highest: 314ns
libriscv: syscall print => median 66ns lowest: 64ns highest: 129ns
lua5.3: syscall print => median 214ns lowest: 209ns highest: 265ns
libriscv: complex syscall => median 850ns lowest: 823ns highest: 949ns
lua5.3: complex syscall => median 1410ns lowest: 1390ns highest: 1493ns

All these measurements were done on Clang-11 from trunk. It has been reliably 20–25% faster than GCC for every single measurement I’ve done. In contrast, GCC was faster on RISC-V. And Clang keeps crashing when compiling my RISC-V code, so there is that too, I suppose.

RISC-V self-test OK
* All benchmark results are measured in 2000 samples
libriscv: array append => median 55ns lowest: 53ns highest: 84ns
libriscv: many arguments => median 200ns lowest: 193ns highest: 236ns
libriscv: integer math => median 56ns lowest: 54ns highest: 87ns
libriscv: syscall print => median 80ns lowest: 77ns highest: 117ns
libriscv: complex syscall => median 1130ns lowest: 1084ns highest: 1238ns

As you can see GCC 9.2 is just plain worse when using the best settings for GCC.

C++ has a large toolbox that allows you to make some crazy things provided you really think about what it is you need. There were a few things left to solve though. For example implementing system call handlers was a chore: too much boilerplate. And making a great system call API on the VM-side. Also, maybe there are some crazy things we can still do?

Well, first off, we need to simplify system call handling. Let’s go back to the C++17 folds and see if we can’t iterate over the argument types again, and then based on each type put it into a tuple, and then return that tuple.

template <int W>
template<typename... Args, std::size_t... Indices>
inline auto Machine<W>::resolve_args(std::index_sequence<Indices...>) const
{
std::tuple<std::decay_t<Args>...> retval;
size_t i = 0;
size_t f = 0;
([&] {
if constexpr (std::is_integral_v<Args>) {
std::get<Indices>(retval) = sysarg<Args>(i++);
}
else if constexpr (std::is_floating_point_v<Args>)
std::get<Indices>(retval) = sysarg<Args>(f++);
else if constexpr (is_stdstring<Args>::value)
std::get<Indices>(retval) = sysarg<Args>(i++);
else if constexpr (std::is_pod_v<std::remove_reference<Args>>)
std::get<Indices>(retval) = sysarg<Args>(i++);
else
static_assert(always_false<Args>, "Unknown type");
}(), ...);
return retval;
}

With that we should be able to use , like so:

static long dm_camera_center_on(machine_t& m) {
auto [x, y] = m.template sysargs <double, double> ();
game.camera().center(glm::vec2(x, y));
return 0;
}

Now that is a nice system call handler.

Sadly, the malloc implementation in the libc of the RISC-V binaries is too slow, no matter how good it is. To combat that we can just do the chunk management outside of the VM. The memory won’t go anywhere, it’s still inside the VM. We will just implement , and as system calls:

machine.install_syscall_handler(SYSCALL_MALLOC,
[arena] (auto& machine) -> long
{
const size_t len = machine.template sysarg<address_type<W>>(0);
return arena->malloc(len);
});
...
machine.install_syscall_handler(SYSCALL_FREE,
[arena] (auto& machine) -> long
{
const auto ptr = machine.template sysarg<address_type<W>>(0);
return arena->free(ptr);
});

Naturally we will give the return value it always deserved to have. Now we have a quite fast heap implementation compared to before. Enough to not be afraid to use containers and strings.

So, perhaps it’s time to benchmark string passing and such things? Well, it turns out that Lua doesn’t really copy strings around much, it just passes pointers to them, while I have to copy them into the virtual memory. Lua beat me hands down, until I realized that in many cases you actually don’t care about the dynamic string so much as just . So, with constexpr programming we can just create compile-time CRC32 values of strings that are constant and just pass that around instead:

...
template <uint32_t POLYNOMIAL = 0xEDB88320>
inline constexpr auto crc32(const char* data)
{
constexpr auto crc32_table = gen_crc32_table<POLYNOMIAL>();
auto crc = 0xFFFFFFFFu;
for (auto i = 0u; auto c = data[i]; ++i) {
crc = crc32_table[(crc ^ c) & 0xFF] ^ (crc >> 8);
}
return ~crc;
}

Suddenly it’s no contest anymore. For a game engine this is perfect, as I’ll often be referring to something like , and now I have a way that is 3x faster than Lua strings. But only for constant strings.

libriscv: syscall print => median 71ns  lowest: 67ns  highest: 128ns
lua5.3: syscall print => median 216ns lowest: 209ns highest: 292ns

What about threads? Lua has coroutines which are very nice, and hard to beat it turns out. Still, I managed to implement some micro-threads that have the same performance as Lua coroutines:

The big time-saver was storing arguments on the threads stack to save a heap allocation. Using a std::tuple we can std::move the arguments into it, and then store them on the stack, which is much faster than capturing them:

template <typename T, typename... Args>
inline Thread* create(const T& func, Args&&... args)
{
char* stack_bot = (char*) malloc(Thread::STACK_SIZE);
if (stack_bot == nullptr) return nullptr;
char* stack_top = stack_bot + Thread::STACK_SIZE;
// store arguments on stack
char* args_addr = stack_bot + sizeof(Thread);
auto* tuple =
new (args_addr) std::tuple<Args&&...>{std::move(args)...};
...

So, now we can create fast co-operative threads backed by system calls using the same API as C++ threads.

auto* thread = microthread::create(
[] (int a, int b, int c) -> long {
printf("Hello from a microthread!\n"
"a = %d, b = %d, c = %d\n",
a, b, c);
return a + b + c;
}, 111, 222, 333);

These threads benchmark equal or better to Lua coroutines. But not by a lot.

libriscv: micro thread args => median 711ns  lowest: 688ns  highest: 765ns
lua5.3: coroutine args => median 729ns lowest: 714ns highest: 820ns

One thing to note is that the micro-threads are faster when there are no arguments, compared to Lua:

libriscv: micro threads => median 494ns  lowest: 481ns  highest: 704ns
lua5.3: coroutines => median 675ns lowest: 654ns highest: 793ns

Finally, I was annoyed by having to create a function for each type of system call. And not only that, but floats and ints are interchangeable and require you to create named functions for each variant: , , . So, I started investigating how I would be able to just pass parameters directly as a system call. Turns out to not be so easy to do. In theory it should be extremely easy to do because a RISC-V system call is just a regular function call with the register A7 (the 8th integer argument) being set to the system call number, but there is no way to tie that inline assembly into a system call without losing guarantees. It still kinda works, but there are drawbacks:

extern "C" long syscall_enter(...);template <typename... Args>
inline long apicall(long syscall_n, Args&&... args)
{
static_assert(sizeof...(args) < 8, "There is a system call limit of 8 integer arguments");
// The memory clobbering prevents reordering of a7
asm volatile ("li a7, %0" : : "i"(syscall_n) : "a7", "memory");
return syscall_enter(std::forward<Args>(args)...);
}

First off, the variadic C function call turns floats into doubles. I can live with that I suppose, but it also requires me to implement syscall_enter like this:

.global syscall_enter
syscall_enter:
scall
ret

So, not only do I have to load the function address into memory to call it, I also have to execute 2 extra instructions. In contrast, the normal system calls are fully inlined and gets optimized very well. Is there any way to do better? Not really, but there is a solution that I have found to be good enough just to get a general-purpose function, while still keeping the regular system call functions where they make sense:

template <typename... Args>
inline long apicall(long syscall_n, Args&&... args)
{
using function_t = long (*) (...);
// The memory clobbering prevents reordering of t6
asm volatile ("li t6, %0" : : "i"(syscall_n) : "t6", "memory");
return ((function_t) 0x2000) (std::forward<Args>(args)...);
}

What on earth is going on here? Well, if I implement execute traps in the RISC-V emulator and trap on the 3rd page (at 0x2000), I can just make that handle system calls too, and also return to the caller automatically. Turns out that it works just as well, and we save 3 instructions, since loading 0x2000 into a register is a single LUI instruction. It’s fast because the trap only happens when the page to 0x2000. We still have to use a C varadic function call, as I didn’t fully understand what C++ was doing when I cast the function to accept any parameter pack. I also changed A7 to reclaim an integer register and used TMP6 instead, which is less likely to be clobbered during call setup. This will suffice, and it completes the journey from Lua to using C++ as a scripting language.

There are of course many things I still want to try. I am patiently waiting for coroutine support in the riscv-gnu-toolchain. All together Lua is now beat in all cases but one, strings. Still, with Lua you can just write and never have to recompile. No system call numbers will change meaning as Lua handles the API by letting you publish functions into its namespace. There are plenty reasons to use Lua.

So, what is the point of all this? Honestly, it was really fun to do. That is a good enough reason by itself. I thought about just compiling C++ natively, but nah. Consoles generally don’t let you run JIT, and I don’t know if you’re allowed to dynamically load things. Maybe someone with all the hard facts can chime in on this. Are you allowed to JIT on Apple devices?

Thanks for reading.

— gonzo

Operating System Architect