Using C++ as a scripting language, part 4

fwsGonzo
7 min readMar 14, 2022

--

Accelerating standard functions to aid emulation

If you’ve read my previous writings in this series, you will see quite some interesting takes on how to solve things like relatively-safe communication between virtual machines, just for the sake of scripting in a game engine. On top of that, it’s all in a RISC-V emulator I made, which makes everything extra hard because it has to emulate everything correctly, as fast as possible.

Now I’ve gone a few steps forward. And a few steps back in other places.

Accelerating standard functions

I want to tell you about .. uh accelerating standard functions. Make sure you’re sitting down. So, there’s these common functions in C/C++ that both languages (and many others) rely on, and so they are pervasive and must be extremely performant. Namely memcpy, memset, memcmp, strlen, strcmp and so on.

These functions are, simply put, not as fast as they could be when they are emulated. So, the only solution then is to make each and every one of them into a system call wrapper, and then implement system calls that do these things very fast. Should be no problem right? Well…

I am already maintaining my own runtime environment for scripting based on newlib, the GNU C++ standard library and 64-bit RISC-V, with a dose of custom overrides all in a CMake build system. Click to see an example of how to do such a thing.

The first thing you do when you implement a feature such as this is to have a build option that disables it, replacing it with easy to verify defaults. A sanity checking option if you will. These functions are special to the compiler, and one thing you will notice when you implement them yourself is that the compiler will start replacing, for example, your memcpy implementation with a call to memcpy. So, the naive version that implements memcpy as a copy from one buffer to another will call itself in an endless loop. The compiler is really smart and sees that the function could be improved by calling the much faster memcpy function instead! The solution is to add the -fno-builtin compiler flag to the module that contains these variants.

set_source_files_properties(libc.cpp
PROPERTIES COMPILE_FLAGS -fno-builtin)

The second thing you will find out is that if you want a decent standard library on top of all of this, such as newlib & stdlibc++, which the GNU RISC-V toolchain helpfully builds for us, is that most of these system functions have strong symbols, so you can’t easily override them. Thankfully ld has an option called --wrap which lets you implement your own versions of the functions by simply prepending __wrap_. So memcpy becomes __wrap_memcpy, and then we implement the latter function.

Now, the third problem is how do you design this inline assembly to be the fastest way possible? One safe way would be to just write a system call wrapper in assembly (not inline), so that the compiler doesn’t do any shenanigans:

asm(".global __wrap_memcpy\n"
"__wrap_memcpy:\n"
" li a7, " STRINGIFY(SYSCALL_MEMCPY) "\n"
" ecall\n"
" ret\n");

This is called global assembly, and the compiler will not try anything funny with it. What happens here is that since the C ABI and the system call ABI is the same we don’t have to shuffle registers around to make the call. Instead we just set up the system call number in A7 and then invoke the system call. That’s all we have to do.

On the system call handler side (the outside host side), we have an implementation that copies this buffer from A to B in a jiffy. It also does it safely, so that anything that normally would cause a protection fault still does so. It looks something like this:

m.memory.foreach(src, len,
[dst] (auto& mem, address_t off, const uint8_t* data, size_t len) {
std::memcpy(dst + off, data, len);
});

We need to use helpers like foreach on memory because it is page-based, and the memory is not sequential in host memory.

Function calls are expensive

One drawback from using global assembly is that it isn’t visible to the compiler, and it will always generate a function call. Since function calls can do anything, it has to save and restore registers before and after the function call. In an emulated environment it is costly. Here is a random example I found in objdump:

  241228:       ff010113                addi    sp,sp,-16
24122c: 00813023 sd s0,0(sp)
241230: 00050413 mv s0,a0
241234: 01043603 ld a2,16(s0)
241238: 00853503 ld a0,8(a0)
24123c: 00113423 sd ra,8(sp)
241240: 40a60633 sub a2,a2,a0
241244: 138030ef jal ra,24437c
<__wrap_memcpy>

There were others where it was saving more registers before doing the memcpy, but this is a simpler example. It saves the return address before calling memcpy. And the --wrap argument to ld seems to be working well!

These functions will not be inlineable, which is a problem for me because… (Author never had a good reason.)

Fully inlined standard functions

We can do better! First off, we need to enable LTO (Link-Time Optimizations). We also need to --whole-archive the tiny libc that contains these functions to avoid multiple definitions errors. This will allow the compiler to inline the functions. Especially if we are going to attempt to do inline atomic standard memory functions. Atomic in the sense that the whole memory operation is a single instruction to the compiler. And second, we will need to write, for example, memset as this kind of monstrosity:

void* memset(void* vdest, const int ch, size_t size)
{
register char* a0 asm("a0") = (char*)vdest;
register int a1 asm("a1") = ch;
register size_t a2 asm("a2") = size;
register long syscall_id asm("a7") = SYSCALL_MEMSET;
asm volatile ("ecall"
: "=m"(*(char(*)[size]) a0)
: "r"(a0), "r"(a1), "r"(a2), "r"(syscall_id));
return vdest;
}

As well as the strlen function:

size_t strlen(const char* str)
{
register const char* a0 asm("a0") = str;
register size_t a0_out asm("a0");
register long syscall_id asm("a7") = SYSCALL_STRLEN;
asm volatile ("ecall" : "=r"(a0_out) :
"r"(a0), "m"(*(const char(*)[4096]) a0), "r"(syscall_id));
return a0_out;
}

What happens here is that we are explaining to the compiler that the vdest memory area of size bytes is being used. We also require the registers a0, a1 and a2 at the same time. The output is the original vdest, and since there are no output registers (the host does not change any of the registers), the compiler will just assume that nothing really changed except the contents of memory at vdest. That is exactly what we want.

This turns out to be the absolute best way to handle these functions. The assembly is extraordinarily good, and quite an achievement for me:

  240e0c:       01053603                ld      a2,16(a0)
240e10: 00050793 mv a5,a0
240e14: 00853503 ld a0,8(a0)
240e18: 00600893 li a7,6
240e1c: 40a60633 sub a2,a2,a0
240e20: 00000073 ecall

The program is now lean. System call 6 is memcpy. I made the full inlining of the memory and heap system calls a CMake option so that if I want to debug the code I can just toggle a switch and they are back to named function calls.

There are some missed optimization opportunities now. I don’t know why these code snippets can’t be merged:

  240e40:       00008067                ret
240e44: 00008067 ret
240e48: 00008067 ret
240e4c: 00400893 li a7,4
240e50: 00000073 ecall
240e54: 00008067 ret
240e58: 00008067 ret
240e5c: 00400893 li a7,4
240e60: 00000073 ecall
240e64: 00008067 ret
240e68: 00008067 ret
240e6c: 00400893 li a7,4
240e70: 00000073 ecall
240e74: 00008067 ret

System call 4 is free(void*). It only shaved around 1kb off my RISC-V program, but it’s around 1kb of function calls and stored/restored registers that we didn’t want.

I also have a heap implementation outside of the VM that is pretty fast. So, naturally we can inline the heap functions too. We will need to override the C++ new header too:

#pragma once
#include_next <new>
#include <heap.hpp>
inline void* operator new(size_t size) {
return sys_malloc(size);
}
inline void* operator new[](size_t size) {
return sys_malloc(size);
}
inline void operator delete(void* ptr) {
sys_free(ptr);
}
inline void operator delete[](void* ptr) {
sys_free(ptr);
}

And that, at least, allows us to always inline C++ heap related functions. Let’s test this:

int main()
{
operator delete(operator new(10));
}

The exception that might be thrown probably prevents optimizing away the whole thing, even if it is unused. The calls are indeed inlined:

  2404ec:       00400513              li      a0,4
2404f0: 00100893 li a7,1
2404f4: 00000073 ecall
2404f8: 00050663 beqz a0,240504 <main+0x38>
2404fc: 00400893 li a7,4
240500: 00000073 ecall

Fairly good assembly! Let’s try with malloc and free, and see if it was the exception that was keeping this from being eliminated:

int main()
{
std::free(std::malloc(44));
}

Which produces this assembly:

0000000000240024 <main>:
main():
240024: 00000513 li a0,0
240028: 00008067 ret

Alright, that confirms that suspicion! The compiler can eliminate unused heap memory created from inline system calls! Wow!

Results

The memory copying performance is improved 43x when using the memcpy system call instead of doing the copying manually inside the VM. The memset performance is improved 31x. These arrays are a little bit too small to really get a complete picture, especially since the memset implementation has a special case for zeroes, where it will ignore zero pages, but we get the picture. The point is that emulated instructions running in loops reading and writing memory is extremely slow, and easy to improve on: Making system calls in the emulator is basically free — the overhead is the cost of a blind function call, around ~1ns. Unsurprisingly the benchmark memcpy is quite optimal, but it’s still nice to see:

000127b0 <test_syscall_memcpy>:
test_syscall_memcpy():
127b0: 84c18593 addi a1,gp,-1972
127b4: 79058513 addi a0,a1,1936
127b8: 25800613 li a2,600
127bc: 53058593 addi a1,a1,1328
127c0: 00600893 li a7,6
127c4: 00000073 ecall
127c8: 00008067 ret

So, this all looks fairly scary, but I have made several test programs over the course of this, including building with both Clang and GCC, 32- and 64-bit and it seems to be working just fine. I will say no more lest i jinx myself.

-gonzo

--

--