An Introduction to Low-Latency Scripting for Game Engines, Part 5

8 min readMay 31, 2024

Scripting with Rust

You can find the project here. With the RISC-V toolchain installed, it should build and run without modification.

This is a series, starting with Part 1 here. The first two parts are C++, while part 3 is about Nelua, and part 4 is about Nim. Now, onto Rust.

Basic setup

In order to keep symbols we care about, apparently we need a nightly feature. Incantation time:

rustup +nightly target add riscv64gc-unknown-linux-gnu

This (hopefully) installs the nightly version of the riscv64gc toolchain with standard library support. I’m not a Rust expert, so please let me know if I am wrong about anything here.

Let’s cargo init to create a new empty project in a folder and drop a toolchain config in .cargo/config.toml :

[build]
target = "riscv64gc-unknown-linux-gnu"

[target.riscv64gc-unknown-linux-gnu]
linker = "riscv64-linux-gnu-gcc-12"
rustflags = ["-C", "target-feature=+crt-static","-Zexport-executable-symbols"]

This will use our system RISC-V linker and stop rustc from dropping our exported symbols. And finally, it also makes binaries static. It is strange to me that a Rust toolchain can’t link a program on its own. But, we have RISC-V toolchains from the packaging system. It’s all good.

Let’s see if a basic Rust program works:

use std::thread;

const NTHREADS: u32 = 10;

// This is the `main` thread
fn main() {
    // Make a vector to hold the children which are spawned.
    let mut children = vec![];

    for i in 0..NTHREADS {
        // Spin up another thread
        children.push(thread::spawn(move || {
            println!("this is thread number {}", i);
        }));
    }

    for child in children {
        // Wait for the thread to finish. Returns a result.
        let _ = child.join();
    }
}

And then run the helper build script:

cargo +nightly build --release

Leading to:

this is thread number 0
this is thread number 1
this is thread number 2
this is thread number 3
this is thread number 4
this is thread number 5
this is thread number 6
this is thread number 7
this is thread number 8
this is thread number 9
>>> myscript initialized.

Seems to work! We can work with this!

Creating callable functions

#[no_mangle]
extern "C" fn test1() {
  println!("test1");
}

In order to verify that we can call it, we will grep for test1:

$ riscv64-linux-gnu-readelf -a target/riscv64gc-unknown-linux-gnu/release/rust_program | grep test1
103232: 0000000000014cda    98 FUNC    GLOBAL DEFAULT    4 test1

It’s a long incantation, but it works. The symbol is there!

I tried creating a fair allocation benchmark:

#[no_mangle]
extern "C" fn test2() {
  // Benchmark global allocator
  let layout = std::alloc::Layout::from_size_align(1024, 8).expect("Invalid layout");
  unsafe {
    let raw: *mut u8 = alloc::alloc(layout);
    std::hint::black_box(&raw);
    std::alloc::dealloc(raw, layout);
  }
}

But I’m not a Rust expert. It clocked in at ~320ns. One very nice thing about Rust is that it’s excellent at embedded work. Looking at custom allocators, it seems like it’s possible to override the global allocator.

extern crate alloc;
use alloc::alloc::Layout;
use alloc::alloc::GlobalAlloc;
use std::arch::asm;
const NATIVE_SYSCALLS_BASE: i32 = 490;

struct SysAllocator;

unsafe impl GlobalAlloc for SysAllocator {
    #[inline]
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        let ret: *mut u8;
        asm!("ecall", in("a7") NATIVE_SYSCALLS_BASE + 0,
            in("a0") layout.size(), in("a1") layout.align(),
            lateout("a0") ret);
        return ret;
    }
    #[inline]
    unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut u8 {
        let ret: *mut u8;
        asm!("ecall", in("a7") NATIVE_SYSCALLS_BASE + 1,
            in("a0") layout.size(), in("a1") 1, lateout("a0") ret);
        return ret;
    }
    #[inline]
    unsafe fn realloc(&self, ptr: *mut u8, _layout: Layout, new_size: usize) -> *mut u8 {
        let ret: *mut u8;
        asm!("ecall", in("a7") NATIVE_SYSCALLS_BASE + 2,
            in("a0") ptr, in("a1") new_size, lateout("a0") ret);
        return ret;
    }
    #[inline]
    unsafe fn dealloc(&self, ptr: *mut u8, _layout: Layout) {
        asm!("ecall", in("a7") NATIVE_SYSCALLS_BASE + 3,
            in("a0") ptr, lateout("a0") _);
    }
}

#[global_allocator]
static A: SysAllocator = SysAllocator;

While this isn’t appropriate for a guide, I’m learning here too. So, what this does is call the scripting systems native heap-related system calls for alloc, calloc, realloc and free. I have written extensively about these before, so I won’t go into detail here, but their purpose is to accelerate the performance of heap operations. The new timing for the benchmark is now:

Benchmark: std::make_unique[1024] alloc+free  Elapsed time: 27ns

.. which is a 12x reduction in run-time. A brutal improvement. Now we’re not afraid of using heap allocations like any normal Rust program.

The only thing I didn’t quite understand was, well, the only way I could figure out how to get the allocator into my final program was to add this line at the top of main.rs:

#[path = "sysalloc.rs"] mod sysalloc;

Is there a better way? I’m OK with it, but it would be nice to know what idiomatic bring-in-other-file looks like.

Moving on to the data test:

#[derive(Debug)]
#[repr(C)]
struct Data {
  a: i32,
  b: i32,
  c: i32,
  d: i32,
  e: f32,
  f: f32,
  g: f32,
  h: f32,
  i: f64,
  j: f64,
  k: f64,
  l: f64,
  buffer: [u8; 32],
}

#[no_mangle]
extern "C" fn test4(d: Data) {
  println!("test4 {:?}", d);
}

The data test turned out to be the shortest among all languages so far. The only difference is that it prints the fixed-size string as a byte array. But that’s OK, we can see that the string is there:

test4 Data { a: 1, b: 2, c: 3, d: 4, e: 5.0, f: 6.0, g: 7.0, h: 8.0, i: 9.0, j: 10.0, k: 11.0, l: 12.0,
  buffer: [72, 101, 108, 108, 111, 44, 32, 87, 111, 114, 108, 100, 33, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }

I was not able to build this Rust project with RISC-V compressed instructions disabled, which affected latencies a bit, but not overly much:

#[no_mangle]
extern "C" fn measure_overhead() { }

#[no_mangle]
extern "C" fn bench_dyncall_overhead() {
  unsafe { dyncall3() };
}

With the results:

Call overhead: 11ns
Benchmark: Overhead of dynamic calls  Elapsed time: 10ns

I think we should have a look at the assembly, just in case. These are abnormally high.

0000000000014dd4 <measure_overhead>:
measure_overhead():
   14dd4:       8082                    ret

Looks good!

0000000000014dd6 <bench_dyncall_overhead>:
bench_dyncall_overhead():
   14dd6:       fffff317                auipc   t1,0xfffff
   14dda:       3a030067                jr      928(t1) # 14176 <dyncall3>

That’s an amazing 2 instructions to build an address and jump to it. I’m surprised. Alright, where is the problem then? I just forgot to add the fast_exit helper function. In theory libriscv could just patch the program, but I try to avoid it if I can. What happens instead is I add a second execute segment with just the exit function by default. However it’s always faster when the exit function (used to end function calls into the VM) is in the main execute segment. Hence, if you have a symbol called fast_exit , libriscv will use it.

Call overhead: 4ns
Benchmark: std::make_unique[1024] alloc+free  Elapsed time: 25ns
Benchmark: Overhead of dynamic calls  Elapsed time: 9ns

As an added bonus, the alloc+free benchmark is also faster.

Alright, we’ve matched the other programs except test5. However, I have no doubt that Rust is quite capable of passing data back and forth. Instead, what I’m wondering is if it makes sense to accelerate memory operations.

$ riscv64-unknown-elf-objdump -d rust_program | grep memset | wc -l
185
$ riscv64-unknown-elf-objdump -d rust_program | grep memcpy | wc -l
699

Looks like it does!

Accelerating memory operations

The easiest way to override memset, memcpy and memmove is using linker wrap. So let’s just start with passing the linker arguments to rustflags in .cargo/config.toml:

"-C", "link_args=-Wl,--wrap=memcpy,--wrap=memmove,--wrap=memset"

The binary should now correctly no longer build:

/bin/ld: (.text+0x170a): undefined reference to `__wrap_memset'

A bunch of undefined references to the functions we are going to get native performance from. Alright, if you can’t follow this next part, don’t worry. It’s a bit of inline assembly that invokes the system calls related to memory operations. I did my best to follow the inline assembly documentation:

#[no_mangle]
pub fn __wrap_memcpy(dest: *mut u8, src: *const u8, n: usize) -> *mut u8
{
 unsafe {
  asm!(
   "li a7, 495+0",
   "ecall",
   in("a0") dest,
   in("a1") src,
   in("a2") n,
   options(nostack)
  );
 }
 return dest;
}

#[no_mangle]
pub fn __wrap_memset(s: *mut u8, c: i32, n: usize) -> *mut u8
{
 unsafe {
  asm!(
   "li a7, 495+1",
   "ecall",
   in("a0") s,
   in("a1") c,
   in("a2") n,
   options(nostack)
  );
 }
 return s;
}

#[no_mangle]
pub fn __wrap_memmove(dest: *mut u8, src: *const u8, n: usize) -> *mut u8
{
 unsafe {
  asm!(
   "li a7, 495+2",
   "ecall",
   in("a0") dest,
   in("a1") src,
   in("a2") n,
   options(nostack)
  );
 }
 return dest;
}

#[no_mangle]
pub fn __wrap_memcmp(s1: *const u8, s2: *const u8, n: usize) -> i32
{
 let result: i32;
 unsafe {
  asm!(
   "li a7, 495+3",
   "ecall",
   in("a0") s1,
   in("a1") s2,
   in("a2") n,
   lateout("a0") result,
   options(nostack, readonly)
  );
 }
 return result;
}

With this, the performance of memory operations should improve. Let’s change the allocation benchmark to also set each byte to zero:

  let raw: *mut u8 = alloc::alloc(layout);
  for i in 0..1024 {
    *raw.add(i) = 0;
  }
  std::hint::black_box(&raw);
  std::alloc::dealloc(raw, layout);

The .add() function confused me, but it’s a helper to access pointer offsets based on the pointers own type. Docs. Benchmarking this without the linker-wrap gives us a costly memory operation:

Benchmark: std::make_unique[1024] alloc+memset+free  Elapsed time: 251ns

And re-enabling the helper system calls, we see that it’s a little bit more expensive than just allocating/deallocating, but not by much:

Benchmark: std::make_unique[1024] alloc+memset+free  Elapsed time: 41ns

Subtracting the known allocation cost, gives us 41–25 = 16ns. Oh, it’s a 13x run-time reduction. That’s awesome! I guess that concludes the acceleration (can we call it that?) of the Rust environment. It wouldn’t do to have it perform worse then C/C++ environments just because of lack of care.

Results

With all this in place, it seems the basic Rust environment is fully in place. Due to my lack of Rust skills, I won’t try to create anything fancy with this. But maybe you will?

mod sysalloc;
mod dyncalls;
use std::alloc;
use std::thread;
use dyncalls::*;

const NTHREADS: u32 = 10;

// This is the `main` thread
fn main() {
  // Make a vector to hold the children which are spawned.
  let mut children = vec![];

  for i in 0..NTHREADS {
    // Spin up another thread
    children.push(thread::spawn(move || {
      println!("this is thread number {}", i);
    }));
  }

  for child in children {
    // Wait for the thread to finish. Returns a result.
    let _ = child.join();
  }

  let i = unsafe { dyncall1(0x12345678) };
  println!("dyncall1: {}", i);
}

#[no_mangle]
extern "C" fn test1(a: i32, b: i32, c: i32, d: i32) -> i32 {
  println!("test1 was called with {}, {}, {}, {}", a, b, c, d);
  return a + b + c + d;
}

#[no_mangle]
extern "C" fn test2() {
  // Benchmark global allocator
  let layout = std::alloc::Layout::from_size_align(1024, 8).expect("Invalid layout");
  unsafe {
    let raw: *mut u8 = alloc::alloc(layout);
    std::hint::black_box(&raw);
    std::alloc::dealloc(raw, layout);
  }
}

#[derive(Debug)]
#[repr(C)]
struct Data {
 a: i32,
 b: i32,
 c: i32,
 d: i32,
 e: f32,
 f: f32,
 g: f32,
 h: f32,
 i: f64,
 j: f64,
 k: f64,
 l: f64,
 buffer: [u8; 32],
}

#[no_mangle]
extern "C" fn test4(d: Data) {
  println!("test4 {:?}", d);
}

#[no_mangle]
extern "C" fn measure_overhead() { }

#[no_mangle]
extern "C" fn bench_dyncall_overhead() {
  unsafe { dyncall3() };
}

The final program looks quite OK. All the environment-related and API-related is in another module. And running it:

$ bash rust_and_run.sh 
~/github/libriscv/examples/gamedev/rust_program ~/github/libriscv/examples/gamedev
   Compiling rust_program v0.1.0 (/home/gonzo/github/libriscv/examples/gamedev/rust_program)
    Finished `release` profile [optimized] target(s) in 1.02s
this is thread number 0
this is thread number 1
this is thread number 2
this is thread number 3
this is thread number 4
this is thread number 5
this is thread number 6
this is thread number 7
this is thread number 8
this is thread number 9
dyncall1 called with argument: 0x12345678
dyncall1: 42
>>> myscript initialized.
test1 was called with 1, 2, 3, 4
test1 returned: 10
Call overhead: 4ns
Benchmark: std::make_unique[1024] alloc+free  Elapsed time: 25ns
test4 Data { a: 1, b: 2, c: 3, d: 4, e: 5.0, f: 6.0, g: 7.0, h: 8.0, i: 9.0, j: 10.0, k: 11.0, l: 12.0, buffer: [72, 101, 108, 108, 111, 44, 32, 87, 111, 114, 108, 100, 33, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }
Benchmark: Overhead of dynamic calls  Elapsed time: 10ns
test5

We have sandboxed Rust! Thanks for reading!

-gonzo

An Introduction to Low-Latency Scripting for Game Engines, Part 5

Basic setup

Creating callable functions

Accelerating memory operations

Results

Written by fwsGonzo

No responses yet