Using C++ as a scripting language, part 8

9 min readJul 25, 2023

Improving API function calls using inline assembly

I have experimented with inline assembly before with some success. It is complicated and easy to make mistakes, with potentially weird and mysterious side effects. I think that if I can auto-generate the inline assembly, then it would be very interesting to see what happens if I replace the build-generated opaque API function wrappers used by dynamic calls. And, if there is a bug, it can be solved once and forever for all dynamic calls.

If you haven’t read any of my blog posts before, I recommend doing that as this will all seem very mysterious without that history. I am using interpreted RISC-V as a sandbox for my game scripts. It is working very well, and has turned out to be very useful over time. Not all of my blog posts are interesting for everyone — there’s a little bit of everything!

On dynamic calls (API function calls)

So, what is a dynamic call? Simply put, it’s a function given a name that is accessible in a game engine, or any other scripting host. For example, if I want to invoke “Game::exit()”, it could be a wrapper for the function call “sys_game_exit” which is a build-time-generated dynamic call. The dynamic call implementation is simple enough: It’s a system call with arbitrary arguments and some extra temporary registers that identifies the call both by hash and by name. That way, the engine can tell what you’re trying to do, and if something goes wrong, so can you too, with rich error reporting.

A single opaque dynamic call:

Registers A0-A6 are function arguments (inputs, if you will)
Register T0 is the hash of the API function name (eg. crc32(Game::exit))
Register T1 is the (pointer to the) name (eg. “Game::exit\0”)
A7 is the “dynamic call” system call number (an inflexible number)
A0 can be (re-)used to return a value back to the script

And it all ends with a single invocation of the ecall instruction, which traps out of the VM and executes the system call in the game engine. This is all RISC-V as I have written about before. At the engine side the hash is looked up, and the callback function for Game::exit is then executed.

So, basically it is a way to make the game engine do something, it is a part of the build system, and it always has a human-readable name just in case.

A dynamic call

inline bool Game::is_debugging()
{
  return sys_is_debug();
}

In the game engine it can be implemented like this:

 Script::set_dynamic_call(
   "Debug::is_debug", [](Script& script)
    {
      auto& machine = script.machine();
      machine.set_result(script.is_debug());
    });

The callback function gets access to the virtual machine, so that it can read arguments and write back a result. Here we just set the boolean “is_debug” as a result. Hence, the API function will now correctly query whether or not we are in debug mode at the time of the call.

Finally, there is a JSON element that generates the system call wrapper in each game script, as part of the build system:

{
  "Debug::is_debug": "int sys_is_debug ()",
  "": "..."
}

It’s a bit of a chore to create and implement a function, but at least there’s no strange issues. If something is wrong or there are collisions, the build system will tell you early on. If a crash happens while running, again we will see the name of problematic API function!

This works super well, and I’ve used it for a long time now. That said, it has certain fixed overheads. It loads a few extra registers and it requires an opaque call with a return. It is possible to skip the return instruction in the game engine (and I actually do that), but after thinking about this for a while I like the idea of a secondary implementation of each dynamic call with extra bang for buck. Some dynamic calls are invoked more than others etc. Ideally they would each get their own system call number, but I’ve tried that, and it creates a lot of versioning issues that are hard to track down.

An inline system call

Modern inline assembly for system call invocation is fairly straight-forward. You use the register keyword to lock down some registers, and then you use these registers in a final system call invocation:

inline long syscall(long n)
{
  register long a0 asm("a0");
  register long syscall_id asm("a7") = n;

  asm volatile ("scall" : "=r"(a0) : "r"(syscall_id));

  return a0;
}

System call number n, with no arguments, however it can return a value in A0. Note that if the game engine never changes register A0 when it handles this system call, then not surprisingly, the return value is whatever A0 was before the system call was invoked! Could be any value, really. So, it’s just better if we can get the build system to generate this based on a specification.

You also have to manually handle this system call in the game engine. If you ever change the system call number, everything breaks in weird ways. Because of this, it's really only for things like Linux syscall emulation, and for special things like threads, multi-processing etc. where custom system calls makes sense.

So, in order to make everyones life easier, a single system call is set aside for dynamic calls.

Inline assembly for an opaque dynamic call

Opaque dynamic calls are reliable and fairly optimal. They are generated by the build system, and they look something like this:

__asm__("\n\
.global sys_empty\n\
.func sys_empty\n\
sys_empty:\n\
  li t0, 0x68c73dc4\n\
  lui t1, %hi(sys_empty_str)\n\
  addi t1, t1, %lo(sys_empty_str)\n\
  li a7, 504\n\
  ecall\n\
  ret\n\
.endfunc\n\
.pushsection .rodata\n\
sys_empty_str:\n\
.asciz \"empty\"\n\
.popsection\n\
");

It’s hard to read, but what it does is create a global symbol of type function with the name sys_empty. The original specification is:

"empty":      "void sys_empty ()"

RISC-V system call ABI is exactly like the C ABI (and even if it wasn’t we will just mandate that it is!) The result is that no matter how many arguments or how many return values, it will appear as a C function call on both sides, despite going through a system call and requiring T0 and T1 for lookup and error handling. Quite low overhead, actually!

There is some redundancy here, though. We make an opaque function call which can create a lot of pushing and popping on the caller. The function call itself is not free, and we also use T0 and T1 registers. All in all, it’s about 6 or 7 redundant instructions.

Inline dynamic calls

What if we just use the system call number itself as the hash value, and then if n ≥ 600 (where regular system calls end), treat it as a dynamic call in the game engine? It’s possible because we can error out if the hash is colliding with “real” system calls at build time. We can also ditch T1 and error out with a vaguer error message, however it shouldn’t be much of an issue because when implementing an API call we should be starting out with the safe and reliable opaque version, and then switch over to the inline assembly variant only when everything works. Ideally!

So, the idea is to generate an inline assembly function based on the information in the JSON entries. An example:

extern unsigned sys_gui_label (unsigned, const char *);

Above: The opaque dynamic call header prototype of creating a new GUI label. The generated assembly looks exactly like every other opaque wrapper function as seen before.

static inline unsigned isys_gui_label (unsigned arg0,const char * arg1) {
  register unsigned ra0 asm("a0");
  register uint32_t a7 asm("a7") = 0xf08cd072;
  register unsigned a0 asm("a0") = arg0;
  register const char * a1 asm("a1") = arg1;
  asm("ecall" : "=r"(ra0) : "r"(a0),"r"(a1),"m"(*a1),"r"(a7) : );
  return ra0;
}

Above: The inline assembly variant is now also generated at build-time.

Inline assembly is difficult to always get right, but we will do our best. In the GUI label case, we have an unsigned return value in A0 (named ra0), an unsigned input argument in A0 (named a0), and a C-string in a1. I decided to always split A0 into two statements since the types can differ, and we have to dereference the string in order to both lock down the register and the memory location. I learned about “m” the hard way, like many I assume.

As we can see from the inline function, the inlined version is just the same function with an i prepended. sys_gui_label becomes isys_gui_label and so on.

Benchmarks

Inline dynamic calls benefit immensely from being called repeatedly, regardless of which call it is, while opaque dynamic calls will have a fixed overhead that cannot be optimized away.

In order to measure the real benefits, we must make a few calls sequentially, with and without arguments, and see how it relates on average to opaque dynamic calls.

The assembly for calling the API function 4 times is as expected, optimal:

0000000050000610 <_ZL22inline_dyncall_handlerv>:
    50000610:   68c748b7                lui     a7,0x68c74
    50000614:   dc48889b                addiw   a7,a7,-572 # 68c73dc4 <__BSS_END__+0x18c550ac>
    50000618:   00000073                ecall
    5000061c:   00000073                ecall
    50000620:   00000073                ecall
    50000624:   00000073                ecall
    50000628:   00008067                ret

The hash is loaded into A7. The return instruction is a part of the benchmark, but the overhead of the benchmarking is measured beforehand and subtracted out.

When mixing 8 functions, 4x with no arguments and 4x with 3 integral arguments, the inline version also looks extremely good:

000000005000062c <_ZL22inline_dyncall_args_x4v>:
    5000062c:   68c74737                lui     a4,0x68c74
    50000630:   dc47089b                addiw   a7,a4,-572 # 68c73dc4 <__BSS_END__+0x18c550a4>
    50000634:   00000073                ecall
    50000638:   e82517b7                lui     a5,0xe8251
    5000063c:   9f07889b                addiw   a7,a5,-1552 # ffffffffe82509f0 <__BSS_END__+0xffffffff98231cd0>
    50000640:   00100513                li      a0,1
    50000644:   00200593                li      a1,2
    50000648:   00300613                li      a2,3
    5000064c:   00000073                ecall
    50000650:   dc47089b                addiw   a7,a4,-572
    50000654:   00000073                ecall
    50000658:   9f07889b                addiw   a7,a5,-1552
    5000065c:   00000073                ecall
    50000660:   dc47089b                addiw   a7,a4,-572
    50000664:   00000073                ecall
    50000668:   9f07889b                addiw   a7,a5,-1552
    5000066c:   00000073                ecall
    50000670:   dc47089b                addiw   a7,a4,-572
    50000674:   00000073                ecall
    50000678:   9f07889b                addiw   a7,a5,-1552
    5000067c:   00000073                ecall
    50000680:   00008067                ret

Because the compiler is informed about which registers change value when performing each dynamic call, and all this is auto-generated by the build system, it will not restore arguments more than once here. Very nice! It also changes between two API calls in just one instruction. You can imagine the second test is something like this:

void mixed_test() {
    Game::something();
    Game::some_args(1, 2, 3);
    Game::something();
    Game::some_args(1, 2, 3);
    Game::something();
    Game::some_args(1, 2, 3);
    Game::something();
    Game::some_args(1, 2, 3);
}

A casual benchmark of the safe opaque calls vs the inlined assembly variants shows that the inlining is quite a bit faster:

The inlined variants are almost 3x faster, which is awesome to see. Most dynamic calls should be completely safe using the inlined variant, as they are usually just peddling integers. The first test is just repeated calling an empty function with no arguments, while the second one is mixing two API functions with 3 arguments.

I have previous benchmarks with direct system calls and LuaJIT:

libriscv: syscall overhead median 2ns     lowest: 2ns      highest: 6ns
luajit: syscall overhead   median 11ns    lowest: 10ns     highest: 18ns
lua5.3: syscall overhead   median 23ns    lowest: 21ns     highest: 33ns

So, an argument-less API call required around ~3ns when inlined, while a direct system call was only 2ns. For LuaJIT it was ~11ns. That is pretty good considering we have to do a hash lookup. Lua is also bytecode interpreted like libriscv. I suppose all of us have to do lookups to support user-friendly APIs.

My goal is to reach direct system call overhead with these dynamic calls (aka. API function calls). It would only be possible if we could number them in a way that didn’t break when you add and remove API functions over time, without having to recompile everything. (EDIT: I did this, and it was only marginally better with many downsides.)

Conclusion

So, why even optimize something that seems to be quite fast to begin with? Well, when running a real game, API functions back into the game engine is pretty much all the script is doing, apart from entering and leaving the guest VM (where the script is hosted). It must be good at this one thing, and it must be flexible and reliable. It helps when it’s part of the build system (or fully automatic, like in Lua), and it should error out as early as possible when things don’t align — preferably at build-time.

Now with the option of choosing the inlined variants as needed, I can pretty much halve the cost of whichever API functions are called often. I have two game projects going, and in the second game I am calling certain VM functions millions of times. Sometimes that warrants an algorithm change, and other times you keep calling it a million times because that’s what gives you the most creative options. 🌄

Seeing how well the compiler can optimize the assembly is always interesting to see. Auto-generating these functions and just having them work each time in all kinds of combinations feels like an under-utilized way of getting free performance.

We are still working on our unnamed game! This is the work-in-progress playable overworld.

-gonzo