Automation and sandbox games in general pretty much require running lots of small functions. Players are going to build until they can’t anymore!
I’ve been fascinated by interpreters for a long time, and even making my own. In this blog post I will have a look at the overheads related to quickly getting in and out to do the small amounts of scripting by my game. In particular I will pick a gold standard WASM implementation to compare against.
Maybe it’s unfair to compare oneself to a 350k LOC JIT-based WASM emulator that uses a simplified and more straight-forward bytecode architecture than RISC-V. Maybe not. libriscv is a register-machine after all, and it should mean lower latency. With binary translation disabled it should be around 18k LOC according to scc, and 20k with it enabled.
RISC-V has its issues, but it is a register machine. Ultimately, let’s be frank, I wrote libriscv, and so it should be doing exactly what I need it to do. That same thing cannot be said for another persons or other peoples projects. So, this blog post is not intended as a critique of WASM or its implementations, rather it repeats something we already knew: If you write something for yourself, it’s going to better fit your own needs. Hopefully. 🙏
So, can we compare against a gold-standard WASM implementation and use it in a low latency scenario like game scripting?
Previously
- Using C++ as a game engine scripting language, Part 1
- Using C++ as a scripting language, Part 2
- Using C++ as a scripting language, Part 3
- Using C++ as a scripting language, Part 4
- Using C++ as a scripting language, Part 5
- Using C++ as a scripting language, Part 6
- Using C++ as a scripting language, Part 7
- Using C++ as a scripting language, Part 8
- Using C++ as a scripting language, Part 9
The C++ hill
Last blog post I wrote about a certain function that was called many millions of times over time as we moved around the game. This time I will benchmark that code in my engine, as well as in a simple benchmark program that embeds wasmtime the WebAssembly run-time. I am curious as to what latency it actually has in practice.
Why C++? It takes a very long time to master a language, and I happen to be very productive with C++ right now. So, I’ll probably die on that hill, which is easy enough since Norways topography is extremely all over the place. I even have a hill in my garden.
The code
In order to be able to faithfully recreate the actual program behavior, we need to recreate the logic 1:1, and then make sure that occasionally (or just fairly) we call a function in the host. The benchmarked guest function will invoke a host function when determining the visibility of the bottom of a cube. However, since that function makes a lookup into a database, we will just replace it with an empty host function call, which can be easily made on both systems.
Here is the RISC-V function, with one of the host calls replaced with an empty do-nothing host call for the purposes of benchmarking:
FPUB long stdBenchVisibility(Block src, Block dst, uint16_t facing)
{
if (src.getID() == dst.getID())
facing &= (1 | 2 | 16 | 32);
else if (facing == 8) {
isys_empty(); // 'empty' dyncall on facing == 8
facing = 0; // Remove bottom
}
return facing;
}
And, for WebAssembly I did the equivalent:
#include <stdbool.h>
#include <stdint.h>
extern bool is_transparent(int32_t);
#define GET_ID(x) ((x) & 65535)
__attribute__((used, retain, visibility("default")))
extern uint16_t stdBenchVisibility(uint32_t src, uint32_t dst, uint16_t facing)
{
if (GET_ID(src) == GET_ID(dst))
facing &= (1 | 2 | 16 | 32);
else if (facing == 8 && !is_transparent(dst))
facing = 0; // Remove bottom
return facing;
}
On a related note, the amount of time I had to spend just getting that function to not get GC’d by the linker and also accepted by emcc is both shameful and infuriating. Look at all those function attributes I threw on there to no avail. I used emcc to create an executable .wasm file.
The benchmark
In order to exercise both the guest and the host, we will always execute the path with the host call.
libriscv
const std::string function = "stdBenchVisibility";
Event event { cppcraft::Scripts::get("std"), function };
static constexpr size_t SAMPLES = 1000;
// An event hooked up to a benchmark callback
auto lambda = [event] (const Block& src, const Block& dst, uint16_t facing) mutable -> uint16_t
{
return event.call(src.toPackedValue(), dst.toPackedValue(), facing);
};
library::RollingAvg ra(32, [function] (auto& ra) {
const long ans = ra.average() * (1000000000ULL / SAMPLES);
fmt::print("[Forked call to {}] average={}ns samples={}\n", function, ans, ra.sample_count());
});
for (size_t i = 0; i < 128; i++)
{
library::ScopedRollingAvg sra(ra);
for (size_t j = 0; j < SAMPLES; j++)
{
lambda(Block(j), Block(0), 8);
}
}
No binary translation:
> lowest 6ns median: 6ns highest: 6ns
[std] Measurement "Empty dynamic call" median: 6
> lowest 3ns median: 3ns highest: 3ns
[std] Measurement "Overhead" median: 3
[Forked call to stdBenchVisibility] median=10ns samples=32
[Forked call to stdBenchVisibility] median=10ns samples=64
[Forked call to stdBenchVisibility] median=10ns samples=96
[Forked call to stdBenchVisibility] median=10ns samples=128
libriscv has a known vmcall overhead of ~3-4ns, now measuring 3ns. That makes “Empty dynamic call” 6–3 = 3ns, and the function 10–3–3 = 4ns.
With binary translation:
> lowest 6ns median: 6ns highest: 8ns
[std] Measurement "Empty dynamic call" median: 6
> lowest 4ns median: 4ns highest: 4ns
[std] Measurement "Overhead" median: 4
[Forked call to stdBenchVisibility] median=10ns samples=32
[Forked call to stdBenchVisibility] median=10ns samples=64
[Forked call to stdBenchVisibility] median=10ns samples=96
[Forked call to stdBenchVisibility] median=10ns samples=128
That makes “Empty dynamic call” 6–4 = 2ns, and the function 10–2–4 = 4ns.
wasmtime
inline struct timespec time_now();
inline long nanodiff(struct timespec start_time, struct timespec end_time);
static wasm_trap_t* is_transparent_callback(
void *env,
wasmtime_caller_t *caller,
const wasmtime_val_t *args,
size_t nargs,
wasmtime_val_t *results,
size_t nresults
) {
results[0].of.i32 = 0;
return NULL;
}
int main() {
wasmtime_error_t *error = NULL;
wasm_config_t *config = wasm_config_new();
assert(config != NULL);
wasmtime_config_consume_fuel_set(config, true);
// Create an *engine*, which is a compilation context, with our configured options.
wasm_engine_t *engine = wasm_engine_new_with_config(config);
assert(engine != NULL);
wasmtime_store_t *store = wasmtime_store_new(engine, NULL, NULL);
assert(store != NULL);
wasmtime_context_t *context = wasmtime_store_context(store);
// Load our input file to parse it next
FILE* file = fopen("block.wasm", "r");
if (!file) {
printf("> Error loading file!\n");
return 1;
}
fseek(file, 0L, SEEK_END);
size_t file_size = ftell(file);
fseek(file, 0L, SEEK_SET);
wasm_byte_vec_t binary;
wasm_byte_vec_new_uninitialized(&binary, file_size);
if (fread(binary.data, file_size, 1, file) != 1) {
printf("> Error loading module!\n");
return 1;
}
fclose(file);
// Compile.
printf("Compiling module...\n");
wasmtime_module_t *module = NULL;
error = wasmtime_module_new(engine, (uint8_t*) binary.data, binary.size, &module);
if (!module)
exit_with_error("failed to compile module", error, NULL);
wasm_byte_vec_delete(&binary);
wasm_functype_t *hello_ty = wasm_functype_new_1_1(wasm_valtype_new_i32(), wasm_valtype_new_i32());
wasmtime_func_t is_transparent;
wasmtime_func_new(context, hello_ty, is_transparent_callback, NULL, NULL, &is_transparent);
wasmtime_extern_t import;
import.kind = WASMTIME_EXTERN_FUNC;
import.of.func = is_transparent;
wasm_trap_t *trap = NULL;
wasmtime_instance_t instance;
error = wasmtime_instance_new(context, module, &import, 1, &instance, &trap);
if (error != NULL || trap != NULL)
exit_with_error("failed to instantiate", error, trap);
// Lookup our export function
wasmtime_extern_t fib;
bool ok = wasmtime_instance_export_get(context, &instance, "stdBenchVisibility", strlen("stdCactusVisibility"), &fib);
assert(ok);
assert(fib.kind == WASMTIME_EXTERN_FUNC);
struct timespec t0 = time_now();
static const uint64_t TIMES = 10000;
// Call it repeatedly until it fails
for (int n = 0; n < TIMES; n++)
{
error = wasmtime_context_set_fuel(context, 10000000);
if (error != NULL)
exit_with_error("failed to set fuel", error, NULL);
wasmtime_val_t params[3];
params[0].kind = WASMTIME_I32;
params[0].of.i32 = n;
params[1].kind = WASMTIME_I32;
params[1].of.i32 = 0;
params[2].kind = WASMTIME_I32;
params[2].of.i32 = 8; // bottom of a cube
wasmtime_val_t results[1];
error = wasmtime_func_call(context, &fib.of.func, params, 3, results, 1, &trap);
if (error != NULL || trap != NULL) {
if (trap != NULL) {
wasmtime_trap_code_t code;
assert(wasmtime_trap_code(trap, &code));
assert(code == WASMTIME_TRAP_CODE_OUT_OF_FUEL);
}
printf("Exhausted fuel computing\n");
break;
}
}
struct timespec t1 = time_now();
printf("Nanodiff: %lu\n", nanodiff(t0, t1) / TIMES);
// Clean up after ourselves at this point
wasmtime_module_delete(module);
wasmtime_store_delete(store);
wasm_engine_delete(engine);
return 0;
}
Please let me know if I fumbled anything in the benchmark. That is the actual code being run. For wasmtime the averaged results for 10'000 samples are:
Without doing a host call (facing == 0):
Nanodiff: 48
And with the host call (facing == 8):
Nanodiff: 73
I picked the lowest numbers for every benchmark and everything is built with release-mode. The less samples used, the higher the latency I got, which is very concerning for wasmtime here. This function is not called a million times consecutively either. It’s called in between everything else, with a wide variety of inputs. I’m not sure how to model that fairly, so I’m just going to stick to micro-benchmarking. When I used 100 samples with libriscv I get the same number (10ns), but with wasmtime I get 164ns. Like I said, concerning, and I will look into this later.
Before this final benchmark, I measured the wasmtime call-time without the host call to 48ns. It might seem high, but I’ve gone over the benchmark code many times now, and for example LuaJIT has the same overhead when making a call (~53ns).
Results
For libriscv we have a known ~4ns overhead for the VM function call, 3ns overhead for executing the function itself and a final 3ns for the dynamic host call. Let’s put it on a graph:
wasmtime wins handily on execution, as in the third column wasmtime spent ~1ns on the guest function, while libriscv needed 3ns (4ns when binary translated).
We can say that libriscv can check the sides of our cube 7 times before wasmtime has checked one side, in the case where a function is called many times. Without warming up:
Surprising results. There are error margins, but I ran the tests a few times to get a decent average. The first completely cold VM function call has a ~120ns-ish cost in libriscv and 7–8 microseconds in wasmtime. We can see that we need to make many calls into wasmtime before it is able to reach a stable call time at around 400ns. The wasmtime function reached an average of sub 100ns only after 20+ calls.
Alright, so if you call a function enough times it will eventually settle on a stable call time. But what about a second VM guest function, if we warm up the first function with 10k samples?
It looks like on wasmtime it takes around 400 +/- 100 ns for all 10 calls, and around 30 +/- 10 ns on libriscv. So, we can conclude that the first three VM calls into wasmtime on the first function is not typical, but the rest are. The latency is very high though, averaging around 400 ns. On a project with thousands of tiny functions this could be a serious hindrance. On libriscv we can see that having warmed up a previous call will also warm up a second call, and we can conclude that when there are thousands of tiny functions, simply calling one will warm another. I added binary translation just because I was curious, and it is calling the VM guest function 17% faster on average.
We can say that libriscv can call an arbitrary function 11 times before wasmtime has called it once, with many cold functions. 13 times with binary translation enabled.
Note that this benchmark was very hard to do, and I have repeated it many times, gaining confidence in the process. I think it is very important to not focus too much on the numbers, but rather the discovery that libriscv doesn’t backslide on latency after making cold calls.
Host function calls
It is encouraging to see that libriscv host calls (aka dynamic calls) have been reduced to 3ns now. I had a look at my original blog post, and there are some big latencies there that I have reduced slowly over time. It has been a very long ride chipping off the overhead, but still making them useful and reliable. After all, I am using them every day, and I am the victim if I am overstating them.
It’s important that each call is properly named, stored in each binary separately and resolved at run-time in the engine, matching them one by one to the engines own list of dynamic calls. That way it almost works like dynamic linking, but also fully typed, because it is based on one source of truth that the engine and the scripts both must agree on: The full function prototype, with named arguments. Like this:
"Timer::periodic": "int sys_timer_periodic (float, float, timer_callback, void*, size_t)"
The current and probably last iteration of dynamic calls use a custom instruction where the 12-bit immediate value is the dynamic call number, removing the need for a system call, and register A7.
static const Instruction<MARCH> dyncall_instruction_handler {
[](CPU<MARCH>& cpu, rv32i_instruction instr)
{
auto& scr = script(cpu.machine());
scr.dynamic_call_array(instr.Itype.imm);
},
[](char* buffer, size_t len, auto&, rv32i_instruction instr)
{
return snprintf(
buffer, len, "DYNCALL: 4-byte idx=%u (0x%X)", instr.Itype.imm,
instr.whole);
}};
Emitting custom instructions is fairly straight-forward with RISC-V:
sys_timer_periodic:
.insn i 0b1011011, 0, x0, x0, 1
ret
It’s a 4-byte I-type instruction with opcode=0b1011011, imm=1 and the rest zero (x0, x0, 0). Apparently it’s the second (imm=1) dynamic call! I am using a reserved 128-bit instruction for my custom shenanigans.
In order to actually use the instruction, I have to override the “unhandled instruction callback” in the emulator. By default, it will insert an invalid or unimplemented instruction handler for every unrecognized opcode.
CPU<MARCH>::on_unimplemented_instruction
= [](rv32i_instruction instr) -> const Instruction<MARCH>&
{
if (instr.opcode() == 0b1011011)
{
return dyncall_instruction_handler;
}
return CPU<MARCH>::get_unimplemented_instruction();
};
The guest program has a table of dynamic call descriptors embedded in the binary, storing their hash, name and index one by one. At run-time the engine reads that table, finds the hash and matches it to an in-engine host-side dynamic call, and creates a translation for that, per program. This means that every program will store the dynamic calls it knew about when it was built, making it possible for the engine to translate them across versions, as well as reporting any missing implementations, collisions, and even mismatching arguments or return values.
The current version of dynamic calls is likely the last. They now have strong typing, are easy to implement host-side, are automatically generated guest-side, and have a very low latency. They can pass and return any arguments, and they can be fully inlined (with raw inline assembly) in the guest programs. What more can be done do to improve on them? I honestly cannot think of anything, right now.
It might seem a small change to change from system calls using register A7 to a custom instruction just to save one register, but there are so many host functions in my game that actually use all 8 that I simply could not go back even if I wanted to. Pushing arguments on stack is part of the reason why other solutions have such high latencies, so we will avoid it when we can.
Conclusion
Real game scripting code is messy with lots of jumping around and a ton of API calls into the engine. wasmtime performs well, but it has some very clear overheads that didn’t seem to go away even when benching with 10k samples. libriscv is clearly the opposite. Always low latency even with just 100 samples, but has trouble chewing through instructions in interpreter mode. Ultimately, it is the best choice for game scripting by a large margin for me right now. Lua and LuaJIT has been left in the dust a long time ago, and while I have briefly considered writing a WebAssembly emulator myself, it’s several years of work to start a new project like that.
My christmas wishlist is a way to improve interpreters in general that is also compatible with our current set of compilers.
I love warm countries and live in Norway. My beard is frozen in the picture.
Merry christmas!
-gonzo