Simplifying the C++ API with blocking calls and creating wrappers around groups of system calls.
You can read Part 1 here: https://medium.com/@fwsgonzo/adventures-in-game-engine-programming-a3ab1e96dbde
Many things have changed since Part 1 was written. I enabled LTO in my RISC-V binaries which improved performance. It was not easy to do as the linker crashed regularly. The one thing that worked was to — whole-archive my tiny libc. But, let’s take a look at the benchmarks so far. I’ve added some new ones, and I am benchmarking against LuaJIT instead now:
Anyway, let’s continue.
Faster VM function calls
On the topic of function call overhead: Some minor optimizations together with LTO this made the array append benchmark run in 34 ns instead of 42 ns. A modest ~20% performance improvement.
Another thing I noticed is that in the script often you could just refer to a function by address directly, instead of using a string as the callback value. This is different from other scripting languages, where I don’t believe you have this option. By using a direct function call without having to look up the function address at all, the same benchmark now runs in 25 ns, a ~40% performance improvement!
Passing strings from the VM to the host can be costly, and it is much more natural in C/C++ to simply pass the function itself rather than its name as a string. We will see later how I am passing an inline lambda directly to a system call.
Dialogue API
I wanted to be able to use threads but also not having to manage them, so I made a thread constructor variant with the normal API that self-deletes when it returns. The drawback is that you have to return from the thread main function, but otherwise it’s the same. I have written dialogue APIs before and I always thought they were annoying to use. If they are driven by for example JSON then they don’t always have enough flexibility, but are easy to make, and if they are in the script then it becomes a callback hell. So I wanted to try to make something that looked like co-routines:
microthread::oneshot(
[] (int a, int b) {
dm::dialogue("Hello!");
dm::dialogue("Hello again!");
dm::dialogue(".. and again!");
dm::print("a = ", a, " b = ", b, "\n");
dm::dialogue("... and again again! Cya!");
}, 111, 222);
The dialogue API function simply starts a message box and then blocks the thread using a made-up constant which can be used to unblock any thread that was blocked on it:
inline int dialogue(const char* text)
{
apicall(ECALL_DIALOGUE_FUNCADDR, text, (void(*)()) [] {
microthread::wakeup_one_blocked(REASON_DIALOGUE);
});
return microthread::block(REASON_DIALOGUE);
}
Each message box also has a next-callback which is a function to call once the message box is continuing. That function is passed as an inline lambda above directly to the system call, which will when called unblock any thread that was waiting. REASON_DIALOGUE here is just a number that is meaningful to threads waiting for the message box to continue. It would be possible to wake up every thread, and they would each then check if their block condition had changed, which is unnecessary. Maybe think of the number as a global futex word?
Accelerating Math
Simple system calls are just about the cheapest thing you can do in this scripting environment, so it’s only natural to accelerate math, such as sin, smoothstep and vector normalization.
dm::Map::daylight_event(
[] (uint32_t ticks) {
dm::Map::set_daylight(dm::smoothstep(
0.0, 1.0, 0.6f + 0.4f * dm::sin(ticks * 0.02f))
);
});
The code above modulates the daytime in a pleasing manner, resting longer on the highs and lows.
So, how does accelerating these functions work out in practice at the instruction level? Let’s take vec2::normalize. We will create a wonky test where we hope for inlining and hopefully observe no function prologue and epilogue. That way we know that the compiler has all the information it needs to do the absolute minimum work.
__attribute__((noinline))
void normalize_test()
{
vec2 v {1.0f, 2.0f};
v = v.normalized();
dm::sin(v.x);
dm::sin(v.y);
}
Since we are using constants for the vector we should see some flw loads. Further, we should see that the compiler correctly realizes that everything happens in fa0 and fa1:
00010c24 <_Z14normalize_testv>:
10c24: 8501a507 flw fa0,-1968(gp) # 16054 <__SDATA_BEGIN__+0x4c>
10c28: 8541a587 flw fa1,-1964(gp) # 16058 <__SDATA_BEGIN__+0x50>
10c2c: 1c000893 li a7,448
10c30: 00000073 ecall
10c34: 1bb00893 li a7,443
10c38: 00000073 ecall
10c3c: 20b58553 fmv.s fa0,fa1
10c40: 00000073 ecall
10c44: 00008067 ret
Perfect! It loads the constants from .data and then uses the normalize system call directly.
Wrapping API objects in classes
Objects are easy to work with because they are just 32-bit integral arena offsets that are produced by the engine. An object address comes from a single arena, and is aligned to the object size which makes it easy to verify that the address belongs to a real object. Still, it would have been great to have classes with member functions. To simplify working with objects we can just wrap a struct around the address, like so:
struct Entity {
vec2 xy() const;
vec2 size() const; void move(float x, float y); const uint32_t addr;
};
struct Object : public Entity {
static Object DB(uint32_t hash, float x, float y, int floor);
...
};
To produce an object we use recipes by giving a hash of the objects name:
#define OBJECT_DB(str, ...) dm::Object::DB(crc32(str), __VA_ARGS__)
The CRC32 value is constexpr and can be calculated at compile-time. The engine will look up the object in its database and create it at the specified position. Sadly, I could not find a way to pass constexpr strings as function arguments at all, even when using user-defined string literals, such as “mystring”_hashed.
Timers need not be complicated:
struct Timer {
using Callback = std::function<void(Timer)>; static Timer oneshot(double time, Callback);
static Timer periodic(double period, Callback);
static Timer periodic(double time, double period, Callback); void stop() const; const int id;
};
They really only need a stop function, for periodic timers. With that we can implement sleep:
inline long sleep(double seconds)
{
const int self = microthread::gettid();
Timer::oneshot(seconds, [self] (auto) {
microthread::unblock(self);
});
return microthread::block();
}
If we implement a dolly-mode in the camera we can make the thread block until a destination has been reached, which lets us do something like this:
microthread::oneshot([] {
dm::sleep(2.0);
dm::Camera::dolly({
.speed = 8.0,
.dest = {400, 150}
});
dm::sleep(1.0);
dm::Camera::dolly({
.speed = 3.0,
.dest = {400, 600}
});
dm::Camera::set_mode(Camera::PLAYER);
});
As you can see I chose designated initializers for dollying because it lets me see the member names. Really makes the code less magic. With this we can implement cut-scenes and many future effects using tiny threads. The total memory overhead would be very low as unused thread stack is using a copy-on-write zero-page.
Backtrace functionality
One thing that I felt was necessary was having access to backtrace functionality. Things can go really wrong in these precise languages. I tried to use __builtin_return_address for this with numbers higher than 0 and it doesn’t seem to be implemented for RISC-V on my GCC version. That’s a bummer because tracing backwards is complicated. That said, RISC-V does at least have the RA register, which means we can show information on at least 2 functions in the call stack. Also, having the backtrace functionality on the outside of the guest means we can have access to it at any time, such as when a CPU exception happens. This feature is implemented by looking at PC as well as RA and then matching these locations to symbols in .symtab, which let’s us put a name and an offset to them. For C++ as a last step we also have to demangle the name using __cxa_demangle. To test this, let’s just jump to zero:
void (*you_hate_to_see_it)() = nullptr;
you_hate_to_see_it();
Which resulted in:
[22:08:08] ERROR: Script::call exception: Execution space protection fault
-> [0] 0x00000000 + 0x000: (null)
-> [1] 0x00013ac4 + 0x068: start
-> [-] 0x00013ac4 + 0x000: start
The last entry in the backtrace is simply the VM call itself that is still waiting for a return value. Interestingly, if we do the same jump after a call to sleep inside a thread:
[22:30:38] ERROR: Script::call exception: Execution space protection fault
-> [0] 0x00000000 + 0x000: (null)
-> [1] 0x00010c24 + 0x02c: std::_Function_handler<void (), microthread::oneshot<start::{lambda()#1}>(start::{lambda()#1} const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
-> [-] 0x00013980 + 0x000: dm::Timer::oneshot(double, std::function<void (dm::Timer)>)::{lambda(int, void*)#1}::_FUN(int, void*)
So, we can see we jumped to zero, where from, and which function resumed the thread at the start. Better than nothing! The demangling is a little bit wonky, and I assume that is because the compilers are different. If I had no idea where in the code this was happening I could add line tables to the executables and then just inspect the assembly with objdump -drl. The addresses will appear alongside function and lines, even with heavy inlining, LTO and GC-sections.
Measuring performance in production
Any time I implement a system call for some logic I wonder if I’m optimizing the code, or whether or not it’s worth it to add another system call. I realized that it would be fairly trivial to provide a way to benchmark a function inside the VM as a system call. Let’s call it measure. I could not 100% guarantee that it didn’t have side-effects on the scripting environment itself, so the system call will just make a new temporary machine to run the measurements on, while using the same binary. As an example I measured the sinf() math function against sin via a system call:
>>> Measurement "x86 native sin()" average: 29 nanos
>>> Measurement "RISC-V emulated sin()" average: 639 nanos
As we can see the native handling is 22x faster than the emulated one. Seems worth it!
The measure function takes a test name and a function address and then returns the number of nanoseconds spent on average over 2000 samples. With this we can test for pessimizations during the self-testing phase at startup. We can also use it to validate accelerating other tasks through system calls. To be able to test functions properly, the test function accepted by measure has to be any function, because it has to be able to tie its output to something so that the compiler doesn’t optimize it away. Since the machine is thrown away after use it doesn’t matter if we clobber some registers.
template <typename T>
void measure(const char* testname, T testfunc)
We can make sure that T is a function pointer by static casting it to another function pointer. This allows us to measure any normal functions and capture-less lambdas:
static_cast<void(*)()>(testfunc)
If you want to do these kinds of measurements yourself you have to be careful to use clock_gettime with CLOCK_PROCESS_CPUTIME_ID as the clock id. This is so we can avoid the time spent while being scheduled out. Weirdly, I measured empty function calls to be 9 ns in practice, which is faster than in my other benchmarks. On a related note, one drawback of the measure function is that it includes the function call overhead.
Better Assertions, possibly
Asserts are important in all C and C++ code bases regardless of exception rules. One thing that annoyed me was that the asserts didn’t go away if they happened to evaluate to false. They also don’t show the values used in the expression, which is probably possible to do some intimidatingly complex code. I wanted to see if I could use a tiny implementation to solve some of my problems.
template <typename Expr, typename T>
constexpr inline void check_assertion(
Expr expr, const T& left, const T& right, const char* exprstring,
const char* file, const char* func, const int line)
{
if (UNLIKELY(!expr())) {
print("assertion (", left, ") ", exprstring, " (", right,
") failed: \"", file, "\", line ", line,
func ? ", function: " : "", func ? func : "", '\n');
syscall(SYSCALL_BACKTRACE);
_exit(-1);
}
}#define ASSERT(l, op, r) \
check_assertion([&] { return (l op r); }, l, r, \
#l " " #op " " #r,\
__FILE__, __FUNCTION__, __LINE__)
Alright, so it’s not pretty. I have considered replacing ASSERT with something like ASSERT_EQUAL, ASSERT_LESS and so on, but this also works. And you have to use it like this:
const int test = 1;
ASSERT(test, ==, 0);
Maybe there is a way to overload macros to add support for asserting on a single value at least. But, it should be immensely helpful in the long run:
assertion (1) test == 0 (0) failed: "/path/to/file", line 44, function: start
The magic that makes the whole assert go away if the compiler can see that it always returns false is the lambda that captures the expression. By deferring the evaluation of the expression it can remove the whole function call. Solution came from this gentleman: https://foonathan.net/2016/09/assertions/
Text formatting library
I ended up choosing strf as a text formatting library. My binary sizes increased a tiny bit, but the performance gain was immediately apparent when I compared run-times with my old logs. One startup function became 20% faster just from that change alone, compared to using a tiny C printf library.
Importantly, it has floating-point support, and since it doesn’t use C variadics it will actually print 32-bit floats as-is!
Closing words
So far, overall, I am incredibly happy with C++ as a scripting language. All the hard foundational problems are solved and I have now what feels like a reliable sandbox to run part of the game logic inside of. I just need to continue porting the old API and be on the lookout for other ways to simplify things.
One of the big missing features compared to Lua is the lack of dynamic loading. In Lua you can just include other files and that’s that. Here, there are several ways to go about this, but none are trivial. Implementing a partial dynamic loader, or simply including another binary that is statically linked to another part of the memory and so on. The reason why the second option is a potential candidate is because almost everything happens via system calls, which doesn’t require sharing stuff in .data or .bss. I would much rather have some kind of enforced separation, though. So, I am investigating what it would take to have two separate machines, which operate on the same world. It would probably be the best option, and opens the door for even more address space separation in the future, as things start to mature again. For example, since machines can be serialized, it’s possible to have savegame functionality simply be a machine.
I also have some ideas regarding communication between machines. Each machine will get a machine id and then using a C variadic function call the engine will copy the registers of one machine to another, look up the address of the given function, and then vmcall it. This allows passing integral and floating-point values between machines, as long as nothing refers to the stack, heap or static areas. However, and this is where it gets interesting: We can share pages between machines, in theory, which means that we could have a shared scratch area for these remote function calls. Suddenly you have a fairly robust C++ function call API for communication between machines. Imagine farcall(savegame_machine, &myscratchdata).
That said, it’s dangerous to take a function argument and convert it because it looks like a scratch area address, which could be accidental. But we can solve that by explicitly using a helper function for each argument. We’ll see.
-gonzo