Using C++ as a Scripting Language, part 11

The STOP instruction

13 min readDec 23, 2023

I’ve had a custom instruction that I simply call STOP in my RISC-V emulator for probably a few years now. However, I haven’t really written down how it works and what it can do until today.

I actually got curious myself, although I do have an idea what it does for me, I want to see if I can pin down exactly how it should be used, or if it should be used at all.

In order to reduce latency to the script, it’s important that most of it are reduced to register operations. That means making use of the highly developed and advanced GCC compiler suite: Inlining, LTO, heavy usage of inline assembly, code size reduction through linker garbage collection and today I’ll write just a little bit about how being able to stop anywhere on the dot also can result in more performant code-generation.

Previously

The STOP instruction

STOP is implemented as a custom RISC-V SYSTEM instruction:

void halt()
{
  asm (".insn i SYSTEM, 0, x0, x0, 0x7ff");
}

FYI: Inline assembly without the ::: is automatically assumed to be volatile.

The SYSTEM instruction does have a dedicated number of custom variants, where the highest bit is set. Unfortunately the .insn i directive does not allow you to set the highest bit, so I am just going to ignore that and just set all the bits that I can, landing us on 0x7ff. Bits 28–29 is the privilege level, ranging from usermode to machine level, making this technically a machine-level SYSTEM instruction.

Why a SYSTEM instruction? It’s because the decoder will essentially look for blocks, and STOP will end up as a block-ending instruction that has certain properties: It can reveal the instruction counter (RDCYCLE) and since it ends a block, it can also terminate dispatch. In fact, it is so effective at stopping dispatch, that I am using it to return from functions in my script. The bytecode is really just this:

INSTRUCTION(RV32I_BC_STOP, rv32i_stop) {
  REGISTERS().pc = pc + 4;
  return true;
}

Only having to write the final PC value (ending after the instruction), makes it quite fast.

Root functions

A root function is a function that never gets called from anywhere else (from normal code, at least), and typically unwinding would stop at a root function. For example, a regular program usually begins at _start, directly jumps to and ends at some kind of internal libc_start, like so:

void libc_start(int argc, char **argv, char **envp, auxvec_t *auxvec)
{
    ...
    exit(main(argc, argv, envp));
    __builtin_unreachable();
}

_start:
    setup Linux-loader arguments to libc_start
    jmp libc_start

Pseudo-code, but you get the gist of it. Basically, since exit() is noreturn, and we add an unreachable there to make sure, libc_start will not need an epilogue. An epilogue is a bunch of instructions that wind down the function so that it can properly return from where it came. However, since we are not returning anywhere, there is no point. The noreturn attribute, or an unreachable statement allows the optimizer to avoid the entire epilogue. There is talk about a root attribute, but it seems to have made no progress in GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92086

Making a VM call

I’ve written about VM calls before, many years ago, and probably have mentioned them quite a bit since then, but a primer is in order:

It is a function call into a VM guest program that begins and ends with that same function. From the host:

vm.vmcall("my_function");

From the guest:

void my_function() {
    printf("Hello World!\n");
}

The VM call sets everything up for a function call into the program, and also makes sure that when you return from the function, it stops the VM automatically and also forwards (or makes it possible to get) return values. It is, in essence, a regular function call, but into a sandbox.

Returning fast

When making low latency calls into a VM guest, we will essentially call a function which does not know it is a root function, and which must also be able to return. It is almost the same as the example with libc, in that we don’t want to save callee-saved registers or see any epilogues generated. So, can we do something about this?

Let’s bring in the STOP instruction, like so:

inline __attribute__((noreturn)) void return_fast()
{
  asm volatile(".insn i SYSTEM, 0, x0, x0, 0x7ff");
  __builtin_unreachable();
}
template <typename T>
inline __attribute__((noreturn)) void return_fast(T t)
{
  static_assert(std::is_standard_layout_v<T> && std::is_trivial_v<T> && sizeof(T) <= sizeof(__UINTPTR_TYPE__),
    "Return value must be trivial and fit in a register");
  register T a0 asm("a0") = t;
  asm volatile(".insn i SYSTEM, 0, x0, x0, 0x7ff" :: "r"(a0));
  __builtin_unreachable();
}

Using return_fast() and return_fast<T>(T t) we can return from a function the way we would normally do using those functions in place of a regular return statement. If the compiler can see that we always end up with invoking a noreturn function, then it will omit the function teardown, and perhaps not save any callee-saved registers, saving a bunch of instructions. However, since this is not a visible return, the compiler cannot destruct objects, so be aware of that.

Let’s start with a function that isn’t slow by any means, stdBuildDoor. What it does it do all the things needed after a door has been built. Play the right sounds, return any changes to the door depending on what you need.

Block stdBuildDoor(int x, int y, int z, Block blk, Block old)
{
  if (blk.getID() == old.getID()) {
    if (blk.getExtra() & 4)
      Sound::play("door_open", x, y, z);
    else
      Sound::play("door_close", x, y, z);
  }
  return blk;
}

I haven’t really made the doors very fancy, I admit. They change form when you activate them, which triggers this callback you see above, where it’s the same block being built on the same block, but with different extra bits. In short, it’s a callback that sees the door was modified and plays a sound.

Deceptively short in code, but horribly long in machine instructions, and as you will see it has a prologue and epilogue ending with a return instruction:

0000000000407700 <stdBuildDoor>:
stdBuildDoor():
  407700:       f7010113                addi    sp,sp,-144
  407704:       08813023                sd      s0,128(sp)
  407708:       08113423                sd      ra,136(sp)
  40770c:       00068413                mv      s0,a3
  407710:       0807473b                zext.h  a4,a4
  407714:       0806c7bb                zext.h  a5,a3
  407718:       00f70c63                beq     a4,a5,407730 <stdBuildDoor+0x30>
  40771c:       08813083                ld      ra,136(sp)
  407720:       00040513                mv      a0,s0
  407724:       08013403                ld      s0,128(sp)
  407728:       09010113                addi    sp,sp,144
  40772c:       00008067                ret
  407730:       06913c23                sd      s1,120(sp)
  407734:       07213823                sd      s2,112(sp)
  407738:       07313423                sd      s3,104(sp)
  40773c:       07413023                sd      s4,96(sp)
  407740:       05513c23                sd      s5,88(sp)
  407744:       05613823                sd      s6,80(sp)
  407748:       02d41793                slli    a5,s0,0x2d
  40774c:       00050913                mv      s2,a0
  407750:       00058493                mv      s1,a1
  407754:       00060693                mv      a3,a2
  407758:       0807ce63                bltz    a5,4077f4 <stdBuildDoor+0xf4>
  40775c:       00429ab7                lui     s5,0x429
  407760:       04010a13                addi    s4,sp,64
  407764:       03413823                sd      s4,48(sp)
  407768:       010a8513                addi    a0,s5,16 # 429010 <_ZTVN10__cxxabiv117__class_type_infoE+0x1420>
  40776c:       24400893                li      a7,580
  407770:       010a8b13                addi    s6,s5,16
  407774:       00000073                ecall
  407778:       00f00793                li      a5,15
  40777c:       00050993                mv      s3,a0
  407780:       10a7ee63                bltu    a5,a0,40789c <stdBuildDoor+0x19c>
  407784:       00100793                li      a5,1
  407788:       000a0513                mv      a0,s4
  40778c:       0cf99c63                bne     s3,a5,407864 <stdBuildDoor+0x164>
  407790:       06400793                li      a5,100
  407794:       04f10023                sb      a5,64(sp)
  407798:       01350533                add     a0,a0,s3
  40779c:       03313c23                sd      s3,56(sp)
  4077a0:       00050023                sb      zero,0(a0)
  4077a4:       03013503                ld      a0,48(sp)
  4077a8:       00048613                mv      a2,s1
  4077ac:       00090593                mv      a1,s2
  4077b0:       b70f90ef                jal     ra,400b20 <sys_sound_play>
  4077b4:       03013503                ld      a0,48(sp)
  4077b8:       01450863                beq     a0,s4,4077c8 <stdBuildDoor+0xc8>
  4077bc:       04013583                ld      a1,64(sp)
  4077c0:       00158593                addi    a1,a1,1
  4077c4:       7fd010ef                jal     ra,4097c0 <_ZdlPvm>
  4077c8:       08813083                ld      ra,136(sp)
  4077cc:       00040513                mv      a0,s0
  4077d0:       08013403                ld      s0,128(sp)
  4077d4:       07813483                ld      s1,120(sp)
  4077d8:       07013903                ld      s2,112(sp)
  4077dc:       06813983                ld      s3,104(sp)
  4077e0:       06013a03                ld      s4,96(sp)
  4077e4:       05813a83                ld      s5,88(sp)
  4077e8:       05013b03                ld      s6,80(sp)
  4077ec:       09010113                addi    sp,sp,144
  4077f0:       00008067                ret
  4077f4:       00429ab7                lui     s5,0x429
  4077f8:       02010a13                addi    s4,sp,32
  4077fc:       01413823                sd      s4,16(sp)
  407800:       000a8513                mv      a0,s5
  407804:       24400893                li      a7,580
  407808:       000a8b13                mv      s6,s5
  40780c:       00000073                ecall
  407810:       00f00793                li      a5,15
  407814:       00050993                mv      s3,a0
  407818:       0aa7e463                bltu    a5,a0,4078c0 <stdBuildDoor+0x1c0>
  40781c:       00100793                li      a5,1
  407820:       000a0513                mv      a0,s4
  407824:       04f99e63                bne     s3,a5,407880 <stdBuildDoor+0x180>
  407828:       06400793                li      a5,100
  40782c:       02f10023                sb      a5,32(sp)
  407830:       01350533                add     a0,a0,s3
  407834:       01313c23                sd      s3,24(sp)
  407838:       00050023                sb      zero,0(a0)
  40783c:       01013503                ld      a0,16(sp)
  407840:       00048613                mv      a2,s1
  407844:       00090593                mv      a1,s2
  407848:       ad8f90ef                jal     ra,400b20 <sys_sound_play>
  40784c:       01013503                ld      a0,16(sp)
  407850:       f7450ce3                beq     a0,s4,4077c8 <stdBuildDoor+0xc8>
  407854:       02013583                ld      a1,32(sp)
  407858:       00158593                addi    a1,a1,1
  40785c:       735010ef                jal     ra,409790 <_ZdlPvm>
  407860:       f69ff06f                j       4077c8 <stdBuildDoor+0xc8>
  407864:       f2098ae3                beqz    s3,407798 <stdBuildDoor+0x98>
  407868:       010a8593                addi    a1,s5,16 # 429010 <_ZTVN10__cxxabiv117__class_type_infoE+0x1420>
  40786c:       00098613                mv      a2,s3
  407870:       23f00893                li      a7,575
  407874:       00000073                ecall
  407878:       03013503                ld      a0,48(sp)
  40787c:       f1dff06f                j       407798 <stdBuildDoor+0x98>
  407880:       fa0988e3                beqz    s3,407830 <stdBuildDoor+0x130>
  407884:       000a8593                mv      a1,s5
  407888:       00098613                mv      a2,s3
  40788c:       23f00893                li      a7,575
  407890:       00000073                ecall
  407894:       01013503                ld      a0,16(sp)
  407898:       f99ff06f                j       407830 <stdBuildDoor+0x130>
  40789c:       00c13423                sd      a2,8(sp)
  4078a0:       04054463                bltz    a0,4078e8 <stdBuildDoor+0x1e8>
  4078a4:       00150513                addi    a0,a0,1
  4078a8:       02054e63                bltz    a0,4078e4 <stdBuildDoor+0x1e4>
  4078ac:       284030ef                jal     ra,40ab30 <_Znwm>
  4078b0:       00813683                ld      a3,8(sp)
  4078b4:       02a13823                sd      a0,48(sp)
  4078b8:       05313023                sd      s3,64(sp)
  4078bc:       fadff06f                j       407868 <stdBuildDoor+0x168>
  4078c0:       02054463                bltz    a0,4078e8 <stdBuildDoor+0x1e8>
  4078c4:       00150513                addi    a0,a0,1
  4078c8:       00054e63                bltz    a0,4078e4 <stdBuildDoor+0x1e4>
  4078cc:       00c13423                sd      a2,8(sp)
  4078d0:       260030ef                jal     ra,40ab30 <_Znwm>
  4078d4:       00813683                ld      a3,8(sp)
  4078d8:       00a13823                sd      a0,16(sp)
  4078dc:       03313023                sd      s3,32(sp)
  4078e0:       fa5ff06f                j       407884 <stdBuildDoor+0x184>
  4078e4:       8ddf80ef                jal     ra,4001c0 <_ZSt17__throw_bad_allocv>
  4078e8:       00428537                lui     a0,0x428
  4078ec:       0b050513                addi    a0,a0,176 # 4280b0 <_ZTVN10__cxxabiv117__class_type_infoE+0x4c0>
  4078f0:       a8df80ef                jal     ra,40037c <_ZSt20__throw_length_errorPKc>
  4078f4:       00050413                mv      s0,a0
  4078f8:       01010513                addi    a0,sp,16
  4078fc:       bd9fc0ef                jal     ra,4044d4 <_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE10_M_disposeEv>
  407900:       00040513                mv      a0,s0
  407904:       1c1130ef                jal     ra,41b2c4 <_Unwind_Resume>
  407908:       00050413                mv      s0,a0
  40790c:       03010513                addi    a0,sp,48
  407910:       bc5fc0ef                jal     ra,4044d4 <_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE10_M_disposeEv>
  407914:       00040513                mv      a0,s0
  407918:       1ad130ef                jal     ra,41b2c4 <_Unwind_Resume>

It has a prologue and an epilogue (144 stack bytes), exception throwing code (std::bad_alloc), a regular return statement in the middle there. And so on. I will say though, that it is not necessarily slow. I measured the function to cost ~72ns when I benchmarked it. This is purely a premature optimization I’m doing so that I can use and benchmark the return_fast() function against a real script function. This is not like a previous blog post where I was optimizing a guest function that was called an obscene amount of times.

As I alluded to earlier, we cannot use return_fast() in this function yet as there are temporary strings with construction and destruction code. _ZdlPvmis the sized delete operator, for example. So, just for this I will change the std::string to a std::string_view in order to avoid allocations.

Fully inlined, noreturn variant

I made it use return_fast(), I stopped using const std::string& and instead used std::string_view, and finally I made Sound::play use inline assembly:

0000000000406de4 <stdBuildDoor>:
stdBuildDoor():
  406de4:       00068793                mv      a5,a3
  406de8:       00060893                mv      a7,a2
  406dec:       00050313                mv      t1,a0
  406df0:       00058693                mv      a3,a1
  406df4:       0807c63b                zext.h  a2,a5
  406df8:       0807483b                zext.h  a6,a4
  406dfc:       01060663                beq     a2,a6,406e08 <stdBuildDoor+0x24>
  406e00:       00078513                mv      a0,a5
  406e04:       7ff00073                .4byte  0x7ff00073 # return_fast()
  406e08:       02d79713                slli    a4,a5,0x2d
  406e0c:       02075063                bgez    a4,406e2c <stdBuildDoor+0x48>
  406e10:       00428837                lui     a6,0x428
  406e14:       4d080513                addi    a0,a6,1232 # 4284d0 <_ZTVN10__cxxabiv117__class_type_infoE+0x13e0>
  406e18:       00900593                li      a1,9
  406e1c:       00030613                mv      a2,t1
  406e20:       00088713                mv      a4,a7
  406e24:       0040005b                .4byte  0x40005b # sys_sound_play()
  406e28:       fd9ff06f                j       406e00 <stdBuildDoor+0x1c>
  406e2c:       00428837                lui     a6,0x428
  406e30:       4e080513                addi    a0,a6,1248 # 4284e0 <_ZTVN10__cxxabiv117__class_type_infoE+0x13f0>
  406e34:       00a00593                li      a1,10
  406e38:       00030613                mv      a2,t1
  406e3c:       00088713                mv      a4,a7
  406e40:       0040005b                .4byte  0x40005b # sys_sound_play()
  406e44:       fbdff06f                j       406e00 <stdBuildDoor+0x1c>

So, it was reduced to 1/4th of its size down to 24 instructions, and the function is just as safe to use, and works exactly the same way. All the changes I made were about ~1min of tinkering because the inline assembly for Sound::play is already generated at build time. Every dynamic call has an opaque and an inline variant, and I just have to pick one. I often start with the opaque variant for non-trivial functions, as it is more reliable than the inline assembly, which is still a little bit in the testing stage.

Benchmarking it

Alright, let’s take this final function and measure the difference between return_fast and a simple return statement.

Block stdBuildDoor1(int x, int y, int z, Block blk, Block old)
{
 if (blk.getID() == old.getID()) {
  if (blk.getExtra() & 4)
   isys_for_test("door_open", 9, x, y, z);
  else
   isys_for_test("door_close", 10, x, y, z);
 }
 return blk;
}
Block stdBuildDoor2(int x, int y, int z, Block blk, Block old)
{
 if (blk.getID() == old.getID()) {
  if (blk.getExtra() & 4)
   isys_for_test("door_open", 9, x, y, z);
  else
   isys_for_test("door_close", 10, x, y, z);
 }
 return_fast(blk);
}

I created a dynamic call for testing that takes the same arguments as the sound playing function, except it does nothing. In essence, we are producing the exact same assembly, and will have the same overhead as before (with exception to playing sounds), and the changes did have a measurable impact:

> lowest 3ns  median: 3ns  highest: 3ns
[std] Measurement "Overhead" median: 3

> lowest 69ns  median: 72ns  highest: 76ns
[std] Measurement "BuildDoor" median: 72

> lowest 16ns  median: 16ns  highest: 18ns
[std] Measurement "BuildDoor1" median: 16

> lowest 15ns  median: 15ns  highest: 17ns
[std] Measurement "BuildDoor2" median: 15

“Overhead” here being the time cost of benchmarking an empty function, and BuildDoor being the original function before I slapped some optimizations onto it. So, with return_fast() (BuildDoor2) it is 15–3 = 12ns and with a regular return (BuildDoor1) its 16–3 = 13ns.

I’ve attached the assembly of the inlined variants here:

0000000000406d68 <stdBuildDoor1>:
stdBuildDoor1():
  406d68:       00068793                mv      a5,a3
  406d6c:       0807483b                zext.h  a6,a4
  406d70:       00050313                mv      t1,a0
  406d74:       00058693                mv      a3,a1
  406d78:       00060893                mv      a7,a2
  406d7c:       0807c73b                zext.h  a4,a5
  406d80:       00e80663                beq     a6,a4,406d8c <stdBuildDoor1+0x24>
  406d84:       00078513                mv      a0,a5
  406d88:       00008067                ret
  406d8c:       02d79713                slli    a4,a5,0x2d
  406d90:       02075263                bgez    a4,406db4 <stdBuildDoor1+0x4c>
  406d94:       00428837                lui     a6,0x428
  406d98:       43880513                addi    a0,a6,1080 # 428438 <_ZTVN10__cxxabiv117__class_type_infoE+0x1248>
  406d9c:       00900593                li      a1,9
  406da0:       00030613                mv      a2,t1
  406da4:       00088713                mv      a4,a7
  406da8:       03d0005b                .4byte  0x3d0005b
  406dac:       00078513                mv      a0,a5
  406db0:       00008067                ret
  406db4:       00428837                lui     a6,0x428
  406db8:       44880513                addi    a0,a6,1096 # 428448 <_ZTVN10__cxxabiv117__class_type_infoE+0x1258>
  406dbc:       00a00593                li      a1,10
  406dc0:       00030613                mv      a2,t1
  406dc4:       00088713                mv      a4,a7
  406dc8:       03d0005b                .4byte  0x3d0005b
  406dcc:       00078513                mv      a0,a5
  406dd0:       00008067                ret

0000000000406dd4 <stdBuildDoor2>:
stdBuildDoor2():
  406dd4:       00068793                mv      a5,a3
  406dd8:       00060893                mv      a7,a2
  406ddc:       00050313                mv      t1,a0
  406de0:       00058693                mv      a3,a1
  406de4:       0807c63b                zext.h  a2,a5
  406de8:       0807483b                zext.h  a6,a4
  406dec:       01060663                beq     a2,a6,406df8 <stdBuildDoor2+0x24>
  406df0:       00078513                mv      a0,a5
  406df4:       7ff00073                .4byte  0x7ff00073 # return_fast()
  406df8:       02d79713                slli    a4,a5,0x2d
  406dfc:       02075063                bgez    a4,406e1c <stdBuildDoor2+0x48>
  406e00:       00428837                lui     a6,0x428
  406e04:       43880513                addi    a0,a6,1080 # 428438 <_ZTVN10__cxxabiv117__class_type_infoE+0x1248>
  406e08:       00900593                li      a1,9
  406e0c:       00030613                mv      a2,t1
  406e10:       00088713                mv      a4,a7
  406e14:       03d0005b                .4byte  0x3d0005b
  406e18:       fd9ff06f                j       406df0 <stdBuildDoor2+0x1c>
  406e1c:       00428837                lui     a6,0x428
  406e20:       44880513                addi    a0,a6,1096 # 428448 <_ZTVN10__cxxabiv117__class_type_infoE+0x1258>
  406e24:       00a00593                li      a1,10
  406e28:       00030613                mv      a2,t1
  406e2c:       00088713                mv      a4,a7
  406e30:       03d0005b                .4byte  0x3d0005b
  406e34:       fbdff06f                j       406df0 <stdBuildDoor2+0x1c>

The return_fast() variant unfortunately jumps around a bit. I’m guessing the compiler is not treating it exactly as a return. We observe only a 8% run-time reduction. Oh well.

It gets way more interesting with an opaque function call in the mix. Turning the Sound::play() dynamic call into a regular function call:

0000000000406e58 <stdBuildDoor>:
stdBuildDoor():
  406e58:       ff010113                addi    sp,sp,-16
  406e5c:       00813023                sd      s0,0(sp)
  406e60:       00113423                sd      ra,8(sp)
  406e64:       080747bb                zext.h  a5,a4
  406e68:       00068413                mv      s0,a3
  406e6c:       0806c73b                zext.h  a4,a3
  406e70:       00e78c63                beq     a5,a4,406e88 <stdBuildDoor+0x30>
  406e74:       00813083                ld      ra,8(sp)
  406e78:       00040513                mv      a0,s0
  406e7c:       00013403                ld      s0,0(sp)
  406e80:       01010113                addi    sp,sp,16
  406e84:       00008067                ret
  406e88:       02d41793                slli    a5,s0,0x2d
  406e8c:       00060713                mv      a4,a2
  406e90:       00058693                mv      a3,a1
  406e94:       00050613                mv      a2,a0
  406e98:       0207d463                bgez    a5,406ec0 <stdBuildDoor+0x68>
  406e9c:       00428537                lui     a0,0x428
  406ea0:       00900593                li      a1,9
  406ea4:       55050513                addi    a0,a0,1360 # 428550 <_ZTVN10__cxxabiv117__class_type_infoE+0x1260>
  406ea8:       c79f90ef                jal     ra,400b20 <sys_sound_play>
  406eac:       00813083                ld      ra,8(sp)
  406eb0:       00040513                mv      a0,s0
  406eb4:       00013403                ld      s0,0(sp)
  406eb8:       01010113                addi    sp,sp,16
  406ebc:       00008067                ret
  406ec0:       00428537                lui     a0,0x428
  406ec4:       00a00593                li      a1,10
  406ec8:       56050513                addi    a0,a0,1376 # 428560 <_ZTVN10__cxxabiv117__class_type_infoE+0x1270>
  406ecc:       c55f90ef                jal     ra,400b20 <sys_sound_play>
  406ed0:       00813083                ld      ra,8(sp)
  406ed4:       00040513                mv      a0,s0
  406ed8:       00013403                ld      s0,0(sp)
  406edc:       01010113                addi    sp,sp,16
  406ee0:       00008067                ret

0000000000406ee4 <stdBuildDoor2>:
stdBuildDoor2():
  406ee4:       ff010113                addi    sp,sp,-16
  406ee8:       00813023                sd      s0,0(sp)
  406eec:       00113423                sd      ra,8(sp)
  406ef0:       00068413                mv      s0,a3
  406ef4:       080747bb                zext.h  a5,a4
  406ef8:       0806c6bb                zext.h  a3,a3
  406efc:       00f68663                beq     a3,a5,406f08 <stdBuildDoor2+0x24>
  406f00:       00040513                mv      a0,s0
  406f04:       7ff00073                .4byte  0x7ff00073
  406f08:       02d41793                slli    a5,s0,0x2d
  406f0c:       00060713                mv      a4,a2
  406f10:       00058693                mv      a3,a1
  406f14:       00050613                mv      a2,a0
  406f18:       0007dc63                bgez    a5,406f30 <stdBuildDoor2+0x4c>
  406f1c:       00428537                lui     a0,0x428
  406f20:       00900593                li      a1,9
  406f24:       55050513                addi    a0,a0,1360 # 428550 <_ZTVN10__cxxabiv117__class_type_infoE+0x1260>
  406f28:       bf9f90ef                jal     ra,400b20 <sys_sound_play>
  406f2c:       fd5ff06f                j       406f00 <stdBuildDoor2+0x1c>
  406f30:       00428537                lui     a0,0x428
  406f34:       00a00593                li      a1,10
  406f38:       56050513                addi    a0,a0,1376 # 428560 <_ZTVN10__cxxabiv117__class_type_infoE+0x1270>
  406f3c:       be5f90ef                jal     ra,400b20 <sys_sound_play>
  406f40:       fc1ff06f                j       406f00 <stdBuildDoor2+0x1c>

And I created a graph with inlined and opaque return/return_fast:

So, we only get a tiny 1ns (8%) reduction in run-time when everything is inlined, but a hefty 4ns (22%) run-time reduction when there are opaque function calls in the function. A lot of instructions are gone in the inlined variant compared to the opaque, and it seems to be mostly epilogue related.

Conclusion

Even though most of my C++ scripting gets inlined, as I have more or less gone that route now, I do think this feature is valuable as opaque dynamic calls are very reliable, and this feature did meaningfully improve the run-time of those functions. We can see that a lot of instructions are missing in the return_fast() variant with opaque function calls, including an epilogue. A 22% run-time reduction is nothing to sneeze at, and 8% for the fully inlined variant is still something. Just use it with caution.

It was still peanuts compared to converting const std::string& to std::string_view. String views are zero-copy in the host and can also be forwarded to standard library containers for when heterogenous lookups are enabled, a C++20 feature. Just from the numbers in this blog post, you can calculate the difference that the std::string_view change alone did: 72ns → 18ns is a brutal 75% run-time reduction.

Still, we discovered that there are uses for the STOP instruction, particularly when more performance is needed, as it did improve every function, even those that are simple register operations. And we also saw that it optimizes codegen, leaving out the epilogue. So, perhaps a good guideline is to use it only for those functions that are called millions of times.

-gonzo

Using C++ as a Scripting Language, part 11

The STOP instruction

Previously

The STOP instruction

Root functions

Making a VM call

Returning fast

Fully inlined, noreturn variant

Benchmarking it

Conclusion

Written by fwsGonzo