Prepared VM function calls

fwsGonzo
4 min readJan 4, 2023

--

Preparing function arguments for script function calls

In a previous post I wrote about how I used lambda folds to get exceptionally good codegen for VM function calls. The parameter pack method I have been using for years now has lowered the overhead of making calls into VM to almost nothing.

A function call with 8 arguments is often 26x faster than an equivalent call in Lua. The test looks like this:

 int ret =
machine->vmcall("test_args",
"This is a string", test,
333, 444, 555, 666, 777, 888);
if (ret != 666) abort();

The function name is “test_args”, and it takes 8 arguments, two of which require storing a string and a struct (test) on the stack.

libriscv: many arguments  median 17ns    lowest: 16ns    highest: 32ns

lua5.3: many arguments median 452ns lowest: 437ns highest: 504ns

I started thinking about how I could potentially lower this even more if the arguments I wanted to use were complex and known beforehand. And it occurred to me that I could just steal some of the stack for the call, and never give it back. The stack pointer used for a VM function call is flexible, and I can change it. So, I started out with a class that stores all the necessary bits in itself (such as registers), but crucially pushes all stack arguments to the real stack. After this, the stack gets permanently moved a little bit further down.

Using the class was fairly straight-forward. First, create a local static variable, then just check if it was prepared or not. Of course it can be improved by just preparing it elsewhere and never checking, but we just want to see if it improves over the old way of making VM function calls:

 static riscv::PreparedCall<ARCH> prepper;
if (!prepper.is_prepared()) {
prepper.prepare(machine, "test_args",
"This is a string", test,
333, 444, 555, 666, 777, 888);
}
int ret = prepper.vmcall();

This turned out to be much faster, by at least 3x.

libriscv: many arguments  median 17ns   lowest: 16ns    highest: 32ns
libriscv: prepared args median 5ns lowest: 5ns highest: 5ns

Importantly, the performance is more stable as it is doing way less memory operations.

Now, while this class is fairly nice, we did have to store all 8 integer arguments and all 8 floating-point arguments no matter what. And maybe have a counter for how many register arguments there actually were as a small improvement. And we stopped getting that godlike codegen that produced perfect call setup code, as we are no longer using a parameter pack when doing the actual call. Indeed, the vmcall() function in the PreparedCall struct takes only one argument, and that is the max instruction count for the call (a timeout).

Improving codegen again

By making a regular vmcall with 8 integer arguments I could measure that it took 3ns, compared to the new method which needed 5ns. It’s clear that there is some room for improvement.

So, how can we get back our parameter pack call? Well, it turns out that you can forward parameter packs to a lambda. So, what if I replace the whole class with just a std::function<int(uint64_t)>? Then I could just capture = (all by value) and let the compiler decide what is important.

Pseudo-code:

template <typename... Args>
void PreparedCall::prepare(address_t call_addr, Args&&... args)
{
std::array<address_t, 8> gpr;
unsigned iarg = 0;
([&] {
// Push strings and structs to stack
// Store register values in gpr[] array
}(), ...);

this->m_func =
[=] (uint64_t imax) mutable -> saddr_t
{
([&] {
// Actually set registers here, using gpr[] array
// Crucially, still using the parameter pack here
// for better codegen
}(), ...);
machine.simulate(imax);
};
}

So, the prepare function now copies all stack-pushed arguments (like strings and structs) to the stack, but does nothing for all other arguments. Instead, we forward the arguments to our std::function, and just let the codegen take over and pray that plain integers get turned into store-immediate etc.

This turned out to be extremely good:

libriscv: many arguments  median 17ns    lowest: 16ns    highest: 32ns
libriscv: prepared args median 3ns lowest: 3ns highest: 5ns

lua5.3: many arguments median 452ns lowest: 437ns highest: 504ns

Indeed, we are now 150x faster than Lua at making this 8-argument function call into a VM.

So, how useful are prepared calls? Well I am working on a plugin for the Godot engine right now and this one came up, so I am hopeful that I will be able to use it in some capacity soon. Also, it’s always possible to use memory sharing with the VM in order to avoid stack pushes — and with the benefit of having up-to-date data at all times. But, it does carry the risk of trampling memory used by the engine.

In any case, it’s very cool to me that it is possible to do these kinds of things with standard C++, and it continues to draw me to C++. Thanks for reading!

-gonzo

--

--

fwsGonzo
fwsGonzo

No responses yet