Passing dynamic arguments across the virtual machine boundary
Hello again. I already wrote part 6 before, but it turned out so well that I am just going to make it into a research paper instead, and it will be re-posted after. So, instead I will tackle the relatively boring task of type checked system calls. Many people (really just me) want to use systems languages as a scripting language in their engines, however many attempts and versions exist where we deduce types.
For example, I personally like the simple ABI where both sides don’t know about each others, but a generated API ensures the function definitions and the raw assembly matches my types in the engine. It’s not checked all the way, in that I can’t currently match the types in the engine, but it looks something like this:
struct Timer {
using Callback = Function<void(Timer)>;
static Timer oneshot(float time, Callback);
...
The public API.
using timer_callback = void (*) (int, void*);
inline Timer timer_periodic(float time, float period, timer_callback callback, void* data, size_t size)
{
return {sys_timer_periodic(time, period, callback, data, size)};
}
inline Timer timer_oneshot(float time, timer_callback callback, void* data, size_t size)
{
return timer_periodic(time, 0.0f, callback, data, size);
}
inline Timer Timer::oneshot(float time, Function<void(Timer)> callback)
{
return timer_oneshot(time,
[] (int id, void* data) {
(*(decltype(&callback)) data) ({id});
}, &callback, sizeof(callback));
}
It’s a little bit involved for just an example, but basically I am passing capture storage to the engine so that when it calls back into the C++ script, we can have stateful lambdas as timer callbacks. I’ve gone over this before. Eventually, we call sys_timer_periodic
, which is a system call generated from the engines system API:
__asm__("\n\
.global sys_timer_periodic\n\
.func sys_timer_periodic\n\
sys_timer_periodic:\n\
li t0, 0x811acfe9\n\
la t1, sys_timer_periodic_str\n\
li a7, 504\n\
ecall\n\
ret\n\
.endfunc\n\
.pushsection .rodata\n\
sys_timer_periodic_str:\n\
.asciz \"Timer::periodic\"\n\
.popsection\n\
");
Not easy to read, but it puts a hash in T0, a string that aids the engine in identifying the call in T1 as a fallback, and the rest is regular system call arguments. On RISC-V regular function arguments match system call arguments, so there is no moving registers around, just straight to the system call. Very neat.
Now, on the engines side we read out the arguments like so:
Script::set_dynamic_calls({
{"Timer::stop", [] (Script& script) {
// Stop timer
const auto [timer_id] = script.machine().sysargs<int> ();
timers.stop(timer_id);
}},
{"Timer::periodic", [] (Script& script) {
// Periodic timer
auto& machine = script.machine();
const auto [time, peri, addr, data, size] =
machine.sysargs<float, float, gaddr_t, gaddr_t, gaddr_t> ();
std::array<uint8_t, 32> capture;
if (UNLIKELY(size > sizeof(capture))) {
throw std::runtime_error("Timer data must be 32-bytes or less");
}
machine.memory.memcpy_out(capture.data(), data, size);
... call into engine timer API here ...
machine.set_result(id);
}},
});
Using a list of templated types, we can extract the system call arguments, including strings, (common forms of) vectors and such. Very handy, but still disconnected from what the script might be passing.
Perhaps we can do a little bit better? After all we have had some insanely well-formed assembly generated by constexpr machinery before. It would be sad if we have to loop over arguments or pass them as strings. You know, we are in 2022 now.
Fully dynamic system call arguments
I started out with my end goal, and this is what I want to do in my script:
print("** Fully dynamic system calls **\n");
dynamic_call("my_dynamic_call", 1234, 5678.0, "nine-ten-eleven-twelve!");
It’s easy to create an alias so that my_dynamic_call can be a regular function. The caveat is that we only take specific supported arguments:
- Integers and floating-point values
- Zero-terminated strings (including cvref std::string)
This is all I need for now.
The reason why we choose zero-terminated strings is to increase the chance that the compiler will reduce our monstrosity down to very simple instructions, and not a ton of looping, counting and shifting.
But, what is the plan here? Well, I am hoping to be able to use a parameter pack with a lambda fold, loop over the types in constexpr mode, put them in arrays (in register-consumption forms), and then I want to use custom RISC-V instructions to inform the engine of the type and value of the argument. Indeed, we are going to do separate inline assembly for each argument, so that we don’t break any rules.
For example, a 64-bit integer will fit nicely in a regular integer register. So, to store it for later, we just put the whole value in our temporary register array. And then move on. For a string we will put the pointer to the string in the register array, and then move on. The simplicity in what we’re doing might allow the compiler to see through this mess.
This is what the monstrosity looks like:
template <typename... Args> inline constexpr
void dynamic_call(const std::string& name, Args&&... args)
{
[[maybe_unused]] unsigned argc = 0;
std::array<uint8_t, 8> type {};
std::array<uintptr_t, 8> gpr {};
std::array<float, 8> fpr {};
([&] {
if constexpr (std::is_integral_v<std::remove_reference_t<Args>>) {
gpr[argc] = args;
type[argc] = 0b001;
argc++;
}
else if constexpr (std::is_floating_point_v<std::remove_reference_t<Args>>) {
fpr[argc] = args;
type[argc] = 0b010;
argc++;
}
else if constexpr (is_stdstring<std::remove_cvref_t<Args>>::value)
{
gpr[argc] = (uintptr_t)args.data();
type[argc] = 0b111;
argc++;
}
else if constexpr (is_string<Args>::value)
{
gpr[argc] = (uintptr_t)const_cast<const char *>(args);
type[argc] = 0b111;
argc++;
}
}(), ...);
register long a0 asm("a0");
register float fa0 asm("fa0");
register long t0 asm("t0");
register long syscall_id asm("a7") = ECALL_DYNCALL2;
for (unsigned i = 0; i < argc; i++)
{
t0 = i;
if (type[i] == 0b001) {
a0 = gpr[i];
asm(".word 0b001000000001011" : : "r"(t0), "r"(a0));
} else if (type[i] == 0b010) {
fa0 = fpr[i];
asm(".word 0b010000000001011" : : "r"(t0), "f"(fa0));
} else if (type[i] == 0b111) {
a0 = gpr[i];
asm(".word 0b111000000001011" : : "r"(t0), "r"(a0));
}
}
register const char * name_ptr asm("a0") = name.data();
register const size_t name_len asm("a1") = name.size();
asm("ecall" : : "m"(*name_ptr), "r"(name_ptr), "r"(name_len), "r"(syscall_id));
}
The goal is to keep it simple, do lots of obvious things, and hope the compiler sees what I’m doing. We make 3 arrays, and depending on the type we are looping over we put it in the correct array, record the type and move on to the next. Then, at the end we emit the custom instruction with data bits that correspond to the type, and the register value we stored in a0 or fa0 depending on the type. At the very end, we perform the system call with the name of this dynamic call as the argument.
Looking at the assembly, it looks quite magical:
403dcc: 00000293 li t0,0
403dd0: 4d200513 li a0,1234
403dd4: 0000100b .4byte 0x100b
403dd8: 1a81a507 flw fa0,424(gp) # 41f660 <__SDATA_BEGIN__+0xa0>
403ddc: 00100293 li t0,1
403de0: 0000200b .4byte 0x200b
403de4: 00200293 li t0,2
403de8: 00078513 mv a0,a5
403dec: 0000700b .4byte 0x700b
403df0: 03013583 ld a1,48(sp)
403df4: 02813503 ld a0,40(sp)
403df8: 00000073 ecall
While I’m sure it can be done better now that the idea is there, it is quite strange to see how insanely few instructions the compiler needed to fully make a dynamic call with 4 arguments: The name of the call, the integer, the float and the string. The instructions standing out here is the register I am using to verify which parameter index I am configuring, and it is not necessary.
Without the temporary register as a sanity check:
403dc4: 00813783 ld a5,8(sp)
403dc8: 1f900893 li a7,505
403dcc: 4d200513 li a0,1234
403dd0: 0000100b .4byte 0x100b
403dd4: 1a81a507 flw fa0,424(gp) # 41f660 <__SDATA_BEGIN__+0xa0>
403dd8: 0000200b .4byte 0x200b
403ddc: 00078513 mv a0,a5
403de0: 0000700b .4byte 0x700b
403de4: 03013583 ld a1,48(sp)
403de8: 02813503 ld a0,40(sp)
403dec: 00000073 ecall
It can be performed in 11 instructions, which is nice. It may be possible to make it even smaller by using more argument registers. Regardless, I am beyond happy. We avoided looping over the arguments and branching on the custom instructions to emit.
Custom instructions
Custom instructions should be straight-forward to add in an emulator, as a fallback for when an instruction is not understood during parsing.
According to the RISC-V opcode map there are 4 major opcodes we can use for our own needs. I am using the 0b0001011
opcode for indicating arguments. Opcodes are 7 bits and instructions are 32-bit, so there are 25 bits left in the instruction for customization.
static const Instruction<MARCH> custom_instruction_handler
{
[] (CPU<MARCH>& cpu, rv32i_instruction instr) {
auto& scr = script(cpu.machine());
// Select type and retrieve value from argument registers
switch (instr.Itype.funct3)
{
case 0b001: // 64-bit signed integer
scr.dynargs().push_back(
(int64_t)cpu.reg(riscv::REG_ARG0));
break;
case 0b010: // 32-bit floating point
scr.dynargs().push_back(
cpu.registers().getfl(riscv::REG_FA0).f32[0]);
break;
case 0b111: // std::string
scr.dynargs().push_back(
cpu.machine().memory.memstring(cpu.reg(riscv::REG_ARG0)));
break;
default:
throw "Implement me";
}
},
[] (char* buffer, size_t len, auto&, rv32i_instruction instr) {
return snprintf(buffer, len, "CUSTOM: 4-byte 0x%X (0x%X)",
instr.opcode(), instr.whole);
}
};
We add each argument as a std::any, which is probably not the fastest choice, but it solves the problem neatly. We also get the argument count from the vector size.
std::any
std::any is a weird one. Perhaps the most under-utilized type in all of C++, but it still has its uses. I actually tried to use std::variant here, but the std::any API just ended up being friendlier. I don’t know if the performance is the greatest, nor is it that important, as we can switch implementations at will.
Using it was easy:
fmt::print("Dynamic call: {}\n", name.to_string());
auto& args = script(machine).dynargs();
for (size_t i = 0; i < args.size(); i++)
{
if (args[i].type() == typeid(std::string))
{
fmt::print("Argument {} is a string: {}\n",
i, std::any_cast<std::string> (args[i]));
}
else if (args[i].type() == typeid(int64_t))
{
fmt::print("Argument {} is a 64-bit int: {}\n",
i, std::any_cast<int64_t> (args[i]));
}
else if (args[i].type() == typeid(float))
{
fmt::print("Argument {} is a 32-bit float: {}\n",
i, std::any_cast<float> (args[i]));
}
else if (args[i].type() == typeid(double))
{
fmt::print("Argument {} is a 64-bit float: {}\n",
i, std::any_cast<double> (args[i]));
}
else
{
fmt::print("Argument {} is unknown type: {}\n",
i, args[i].type().name());
}
}
Which resulted in the output:
Dynamic call: my_dynamic_call
Argument 0 is a 64-bit int: 1234
Argument 1 is a 32-bit float: 5678
Argument 2 is a string: nine-ten-eleven-twelve!
The performance of std::any seems to be completely fine. Most of the time is in the setup to the call, and not the call itself:
> median 19ns lowest: 16ns highest: 19ns
>>> Measurement "Benchmark overhead" median: 19 nanos
> median 85ns lowest: 82ns highest: 89ns
>>> Measurement "Dynamic call (no arguments)" median: 85 nanos
> median 109ns lowest: 107ns highest: 123ns
>>> Measurement "Dynamic call (4x arguments)" median: 109 nanos
I swapped out std::any and used my own union, and I basically got the same numbers. The difference was small enough that I am choosing to keep the simpler API, which is std::any by a mile. So, the overhead of the dynamic call mechanism is around 85–19 = 66 ns. It can be massively improved by passing a hash directly instead of the character array as the name of the function, which then gets hashed on system call entry.
EDIT: By doing the hashing at constexpr time in the script programs, we can just pass the hash as a register value. Further, we can encode small integers into our custom instruction directly, leading to gains. And finally, I lowered the call overhead significantly:
> median 4ns lowest: 4ns highest: 4ns
>>> Measurement "Benchmark overhead" median: 4 nanos
> median 16ns lowest: 15ns highest: 16ns
>>> Measurement "Dynamic call (no arguments)" median: 16 nanos
> median 43ns lowest: 43ns highest: 48ns
>>> Measurement "Dynamic call (4x arguments)" median: 43 nanos
The new time ends up at 16–4 = 12ns which is faster than Lua (29ns) and LuaJIT (14ns) for no-argument calls. Both Lua and LuaJIT has a massive overhead when there are arguments, but I will perhaps write about that another time. Encoding the integer as a 12-bit immediate can be simplified using assembler directives:
if ((int64_t)gpr[i] >= -4096 && (int64_t)gpr[i] < 4096) {
asm(".insn i 0b0001011, 0, x0, x0, %0" :: "I"(gpr[i]));
}
The compiler still generates tiny sequences, and now the performance is quite good! Might as well use this method exclusively!
-gonzo