Using C++ as a Scripting Language, part 13

4 min readMay 19, 2024

Optimizing heap-related functions

When you make a game, there are many parts that are never going to be a bottleneck, and in those sections you often write safe idiomatic code. That is usually code that contains temporary heap allocations.

I have a verbose setting for the heap and other system calls that I can use to monitor what the program is doing. However, the real story is in reading the RISC-V assembly. It turns out that when you call new in C++, it first jumps to an empty wrapper, and then to the real C++ new function, which then again would call my heap system call wrapper, which finally would invoke the system call. In C++ it is possible for new to throw an exception when the allocation fails, however that is completely irrelevant here as I am in control of everything. If an allocation fails, the emulator will throw an exception. Same with running out of memory, or running too long, and so on. So, I would like to avoid the new call chain, and ideally directly invoke the system call.

I created a simple benchmark for this:

static void bench_alloc_free()
{
  auto x = std::make_unique_for_overwrite<char[]>(1024);
  __asm__("" :: "m"(x[0]) : "memory");
}

Benchmarking it, it took around 50ns. That’s not too bad. But, it can be improved just by avoiding all the calls that do nothing but call another function.

So, the first thing to do is to call the system call wrapper directly. This meant that I had to forego the return value from free, because in C the free function doesn’t have a return value. Otherwise the A0 register would be clobbered, and I would have a very mysterious bug on my hands. Running it, I found that it heavily reduced the run-time, now at 31ns. A 38% run-time reduction.

The last thing to try, was to write inline functions for new and delete, which would call my inline assembly functions sys_malloc and sys_free:

inline void* sys_malloc(std::size_t size) {
  register void*   ret asm("a0");
  register size_t  a0  asm("a0") = size;
  register long syscall_id asm("a7") = SYSCALL_MALLOC;

  asm volatile ("ecall"
  : "=m"(*(char(*)[size]) ret), "=r"(ret)
  : "r"(a0), "r"(syscall_id));
  return ret;
}
inline void  sys_free(void* ptr)
{
  register void*  a0  asm("a0") = ptr;
  register long syscall_id asm("a7") = SYSCALL_FREE;

  asm volatile ("ecall"
  :
  : "r"(a0), "r"(syscall_id));
}

After remembering to specify that ret clobbers register A0, it all ran smoothly on all my tests, benchmarks and my game. So far so good. The run-time for the benchmark is now at an insane 19ns.

0000000050000b94 <_ZL16bench_alloc_freev>:
    50000b94:   40000513                li      a0,1024
    50000b98:   23a00893                li      a7,570
    50000b9c:   00000073                ecall
    50000ba0:   00050663                beqz    a0,50000bac <_ZL16bench_alloc_freev+0x18>
    50000ba4:   23d00893                li      a7,573
    50000ba8:   00000073                ecall
    50000bac:   00008067                ret

Looking at the assembly; it looks perfect, and probably cannot be improved anymore. No wonder it runs so quickly. It is practically native performance.

A very surprising final result. Inline assembly is insane.

Toggling binary translation always improves latencies. Almost to the point of hard-to-believe. But we can enable verbose malloc/free and observe that it does indeed correctly happen:

...
SYSCALL malloc(1024) = 0x51222800
SYSCALL free(0x51222800) = 0
SYSCALL malloc(1024) = 0x51222800
SYSCALL free(0x51222800) = 0
> lowest 2405ns  median: 2436ns  highest: 10867ns
[gameplay] Measurement "Allocate 1024-bytes, and free it" median: 2436

Terminal printing is the bane of many benchmarks!

Comparison with other emulators

I ran wasmtime with a malloc() -> free() combo and I found that it took on average 66ns, and subtracting out the call overhead of 48ns it gives us a comfortable 18ns run-time. That is very fast. Here’s the chart:

wasmtime ended up being faster vs. interpreted, but not against binary translation.

I do think that the interpreted variant should have won too, and I have a suspect: My homemade heap allocator. It’s the one that manages the heap of the guest program outside of the sandbox, so all it does is manage memory ranges.

I don’t think I have the energy to write a faster heap allocator, though. Especially when mine is robust and well-tested. It has run fine for a long time, and it doesn’t seem to be fragmenting memory too much. It’s always scary to change something like an underlying allocator, when what you’re really trying to do is make a game.

Anyway, that’s cool! Benchmarking is kinda fun, but it’s easy to get it wrong. More variation in the allocations is probably needed for anything conclusive.

However, I did make a note of something strange: If we go back and look at the assembly output we can also see that there’s no exception handling anymore, now instead replacing with a simple null-check to avoid calling free. I tried to get rid of it by using the hidden __returns_nonnull__ attribute that GCC has, but it wouldn’t budge.

-gonzo

Using C++ as a Scripting Language, part 13

Comparison with other emulators

Written by fwsGonzo

No responses yet