Improving performance and using it in a demanding game
I’m currently using my RISC-V emulator as a very efficient scripting system in a very demanding game I am working on. It’s a story too long to begin telling here, but I’ll go over some things I’ve done recently.
Very briefly, when a client connects to the game server, the server will send script programs (ELF binaries) and assets compressed over the wire. The client will then be in sync with the server, and other players. These binaries are heavily instrumented. Every memcpy, malloc/free, new/delete, strlen — and so on, is an efficient system call invoked through LTO as inlined assembly that again executes native code, which is native performance. The script programs are AOT-compiled C++20 w/exceptions and RTTI for RV64gb (I disabled compressed instructions).
Zero-copy lookups
I’ve made an API function call convention that I just nickname dynamic calls. These are based on a JSON document of function definitions, which are parsed at build-time and produces inline assembly functions in the guest programs, and which ensures the host and guest speak the same language. I’ve written about it before. As an example, the function Timer::periodic() will CRC32-C checksum to 0x811acfe9, which is used in-place of a system call number (< 600). CRC32 with the Castagnoli polynomial has dedicated instructions on modern processors. Hash collisions are checked and avoided at build time.
As an example, I measured finding an item from the script by providing a string name, which is used as a key in a database lookup inside the engine takes around 20ns from start to end:
> lowest 24ns median: 24ns highest: 27ns
[std] Measurement "Block::find" median: 24
> lowest 4ns median: 4ns highest: 6ns
[std] Measurement "Overhead" median: 4
It used to be much higher because std::string maps and unordered_maps in C++ used to require lookup only through a string, and that is not something I can pass from the script to the host without actually creating a new temporary string. However, since C++14 you can add support for std::string_view, and it allows the script to make API function calls where any of the arguments is a zero-copy string_view in the host. Using that we can look things up using standard C++:
struct string_hash
{
using hash_type = std::hash<std::string_view>;
using is_transparent = void;
std::size_t operator()(const char* str) const { return hash_type{}(str); }
std::size_t operator()(std::string_view str) const { return hash_type{}(str); }
std::size_t operator()(std::string const& str) const { return hash_type{}(str); }
};
std::unordered_map<std::string, T, string_hash, std::equal_to<>> my_map;
With the above code we can use my_map.find(string_view)!
Script::set_dynamic_call("Blocks::find",
[] (Script& script)
{
auto [name] = script.machine().sysargs<std::string_view>();
script.machine().set_result(db::getb(name));
});
The read-write arena
Inspired by the simplicity of WASM, I’ve experimented with a linear memory area that is always read-write. However, I want all existing features, so I made the arena work with everything else, and it ended with 1 comparison per memory operation. It is a well known idiom, and it goes like this:
// a <= x < b
(unsigned)x - a < (unsigned)b - a
This allows me to have my cake and eat it too.
I can continue having exceptions on null (0x0) reads and writes, page protections for the ELF binary is still in place, and so on. This is because loadable segments in ELF binaries always start with a read-only section, followed by a read-write section. My experience with this feature is that ~100% of all memory operations fall within the single comparison. Neat!
Anything within the arena is fully readable and writable by everyone, and that also includes forked machines. However, anything outside of the arena works just like normal. This added a bit of a problem with scripts that were used by many threads at the same time.
Calling into the script billions of times per second
If you have 16 threads working on producing really expensive terrain, and some parts of the terrain have properties that are dynamic in that the decision is made by a callback into a script program, then the first thing you need is just as many emulators (or vCPUs), so that there are no shared resources. Not just that, but there can be many individual script programs. And each program can be the destination of a callback from any part of the engine, which is heavily multi-threaded.
My emulator supports forking, which is a feature where I clone an existing emulator into a new one that doesn’t use memory until you start changing things (aka copy-on-write). I am using these forks to allow the game script to make decisions that usually would be too costly for scripting in general. Sure, if you just had one program you would just make several emulators each running the same program, and that would be fine, but scalability comes in many shapes, and if you want your game to be moddable, you want people to be able to add their own programs, each with callbacks, and you may find yourself with thousands of programs competing for resources. Even worse, you may have to use synchronization primitives in your scripts. Now, multiply that by the core count on a server.
One such function is comparing the sides of a cactus block to a neighbor which we don’t know until we check:
FPUB long stdCactusVisibility(Block src, Block dst, uint16_t facing)
{
// Remove top and bottom when cacti meet
if (src.getID() == dst.getID())
facing &= (1 | 2 | 16 | 32);
else if (facing == 8 && !dst.isTransparent())
facing = 0; // Remove bottom when meeting a solid
return facing;
}
The assembly looks OK. It could probably be better:
0000000000400f54 <stdCactusVisibility>:
stdCactusVisibility():
400f54: 0805453b zext.h a0,a0
400f58: 0805c7bb zext.h a5,a1
400f5c: 02f50a63 beq a0,a5,400f90 <stdCactusVisibility+0x3c>
400f60: 00800793 li a5,8
400f64: 00f60663 beq a2,a5,400f70 <stdCactusVisibility+0x1c>
400f68: 00060513 mv a0,a2
400f6c: 7ff00073 .insn 4, 0x7ff00073 # return_fast
400f70: 46ede8b7 lui a7,0x46ede
400f74: 9048889b addw a7,a7,-1788 # 46edd904 <__BSS_END__+0x46ab7bb4>
400f78: 00058513 mv a0,a1
400f7c: 00000073 ecall
400f80: 0005051b sext.w a0,a0
400f84: fe0512e3 bnez a0,400f68 <stdCactusVisibility+0x14>
400f88: 00000613 li a2,0
400f8c: fddff06f j 400f68 <stdCactusVisibility+0x14>
400f90: 03367613 and a2,a2,51
400f94: fd5ff06f j 400f68 <stdCactusVisibility+0x14>
If you’ve read my time-to-first-instruction blog post, you will remember the overhead of communicating with other threads on live systems can be quite high. These emulators are plenty, plenty small, and can be used from anywhere. Whichever worker requires the script to do something, will have direct access to a script instance:
// Early checking that the program exists
Event event { Scripts::get(program), function };
block.visibilityComp = [event] (const Block& src, const Block& dst, uint16_t facing) mutable -> uint16_t
{
// Pick existing thread-local emulator, and make the call
return event.forked_call(src.toPackedValue(), dst.toPackedValue(), facing);
};
The event points to the original program, but the forked_call will pick a local fork of the program, appropriate for running on the current thread. Naturally, on a live system things take more time than in a micro-benchmark, but let’s have a look anyway:
// An event that is hooked up to a program
Event event { cppcraft::Scripts::get(program), function };
static constexpr size_t SAMPLES = 1000;
// An event hooked up to a cactus block visibility callback
auto lambda = [event] (const Block& src, const Block& dst, uint16_t facing) mutable -> uint16_t
{
// Make a function call into a script program
return event.forked_call(src.toPackedValue(), dst.toPackedValue(), facing);
};
library::RollingAvg ra(32, [] (auto& ra) {
const long ans = ra.median() * (1000000000ULL / SAMPLES);
fmt::print("[Forked call] median={}ns samples={}\n", ans, ra.sample_count());
});
for (size_t i = 0; i < 128; i++)
{
library::ScopedRollingAvg sra(ra);
for (size_t j = 0; j < SAMPLES; j++)
{
lambda(Block(0), Block(0), 0);
}
}
[Forked call] median=8ns samples=32
[Forked call] median=8ns samples=64
[Forked call] median=8ns samples=96
[Forked call] median=8ns samples=128
[Forked call] median=12ns samples=32
[Forked call] median=12ns samples=64
[Forked call] median=12ns samples=96
[Forked call] median=12ns samples=128
So, with 128 * 1000 samples, we can say that we calculated the visibility of the side of a cactus in ~8-12ns by running interpreted RISC-V. System clocks are imprecise for sub microsecond timings, and so I am unfortunately not able to measure the live values without RDTSC, but this is good enough to see the magnitude of the call.
When the game is running, the callback function will be invoked from any of the 16 worker threads. Out of the 205M voxel area around me in the picture, the engine was doing around 30k fluid simulation function calls per second. I counted it over a minute or so:
[std] says: Fluid simulation calls: 1160192
So, bottom line is: Have an emulator ready to go. Don’t share anything. Make entering, leaving and invoking API functions (from the script into the engine) low cost operations. That’s easier said than done, but once you have it, you can involve the script in things you couldn’t previously.
Fluid simulation
Fluid is currently running on timers. If the fluid hasn’t run out of steam, it will attempt to flow after a certain time:
Timer::oneshot(FLOW_TIME,
[x, y, z, fluid_id] (auto)
{
...
}
But, how can you pass capture storage to the game engine in emulated C++? Turns out you can make your own variant of std::function that never allocates on the heap, and then copy its contents into the timer system. When the timer triggers, we can make a temporary guest heap allocation (because yes, we do control the guest heap), copy the capture storage back, and call the lambda function. Just like that.
[](Script& script)
{
auto& machine = script.machine();
auto [time, peri, addr, data, size]
= machine.sysargs<float, float, gaddr_t, gaddr_t, gaddr_t>();
auto capture = CaptureStorage::get(machine, data, size);
// Avoid extremely high-frequent timers
const int m_time = int(time * 1000);
int m_peri = int(peri * 1000);
if (m_peri < 250) peri = 250;
int id = SimTimers::periodic(
millis(m_time), millis(m_peri),
[addr = (gaddr_t)addr, capture, &script](int id)
{
auto alloc = script.scoped_guest_alloc(capture.size());
script.machine().copy_to_guest(
alloc.get(), capture.data(), capture.size());
script.call(addr, id, alloc.get());
});
machine.set_result(id);
}
The dynamic call handler for Timer::periodic. It takes a duration, a period, a lambda function address, and capture storage as arguments. The capture storage is copied into the engines timer system (via capture storage!), and when it’s time to call back into the script, the capture storage is brought back via a temporary (scoped) guest allocation and copy. I could probably have devised a solution that allocated in the guest and just kept it there the whole time, but that’s for a time when this shows up in the call-graph. This solution makes the script more stateless, and overall more robust. :)
It’s hard to write about things that are hard to work on to begin with. Who in their right mind is making an engine, the emulator to run game scripting with, and a game on top of that? Sounds crazy to me.
-gonzo