Using C++ as a Scripting Language, part 11

The STOP instruction

fwsGonzo
13 min readDec 23, 2023

I’ve had a custom instruction that I simply call STOP in my RISC-V emulator for probably a few years now. However, I haven’t really written down how it works and what it can do until today.

I actually got curious myself, although I do have an idea what it does for me, I want to see if I can pin down exactly how it should be used, or if it should be used at all.

In order to reduce latency to the script, it’s important that most of it are reduced to register operations. That means making use of the highly developed and advanced GCC compiler suite: Inlining, LTO, heavy usage of inline assembly, code size reduction through linker garbage collection and today I’ll write just a little bit about how being able to stop anywhere on the dot also can result in more performant code-generation.

Previously

The STOP instruction

STOP is implemented as a custom RISC-V SYSTEM instruction:

void halt()
{
asm (".insn i SYSTEM, 0, x0, x0, 0x7ff");
}

FYI: Inline assembly without the ::: is automatically assumed to be volatile.

The SYSTEM instruction does have a dedicated number of custom variants, where the highest bit is set. Unfortunately the .insn i directive does not allow you to set the highest bit, so I am just going to ignore that and just set all the bits that I can, landing us on 0x7ff. Bits 28–29 is the privilege level, ranging from usermode to machine level, making this technically a machine-level SYSTEM instruction.

Why a SYSTEM instruction? It’s because the decoder will essentially look for blocks, and STOP will end up as a block-ending instruction that has certain properties: It can reveal the instruction counter (RDCYCLE) and since it ends a block, it can also terminate dispatch. In fact, it is so effective at stopping dispatch, that I am using it to return from functions in my script. The bytecode is really just this:

INSTRUCTION(RV32I_BC_STOP, rv32i_stop) {
REGISTERS().pc = pc + 4;
return true;
}

Only having to write the final PC value (ending after the instruction), makes it quite fast.

Root functions

A root function is a function that never gets called from anywhere else (from normal code, at least), and typically unwinding would stop at a root function. For example, a regular program usually begins at _start, directly jumps to and ends at some kind of internal libc_start, like so:

void libc_start(int argc, char **argv, char **envp, auxvec_t *auxvec)
{
...
exit(main(argc, argv, envp));
__builtin_unreachable();
}

_start:
setup Linux-loader arguments to libc_start
jmp libc_start

Pseudo-code, but you get the gist of it. Basically, since exit() is noreturn, and we add an unreachable there to make sure, libc_start will not need an epilogue. An epilogue is a bunch of instructions that wind down the function so that it can properly return from where it came. However, since we are not returning anywhere, there is no point. The noreturn attribute, or an unreachable statement allows the optimizer to avoid the entire epilogue. There is talk about a root attribute, but it seems to have made no progress in GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92086

Making a VM call

I’ve written about VM calls before, many years ago, and probably have mentioned them quite a bit since then, but a primer is in order:

It is a function call into a VM guest program that begins and ends with that same function. From the host:

vm.vmcall("my_function");

From the guest:

void my_function() {
printf("Hello World!\n");
}

The VM call sets everything up for a function call into the program, and also makes sure that when you return from the function, it stops the VM automatically and also forwards (or makes it possible to get) return values. It is, in essence, a regular function call, but into a sandbox.

Returning fast

When making low latency calls into a VM guest, we will essentially call a function which does not know it is a root function, and which must also be able to return. It is almost the same as the example with libc, in that we don’t want to save callee-saved registers or see any epilogues generated. So, can we do something about this?

Let’s bring in the STOP instruction, like so:

inline __attribute__((noreturn)) void return_fast()
{
asm volatile(".insn i SYSTEM, 0, x0, x0, 0x7ff");
__builtin_unreachable();
}
template <typename T>
inline __attribute__((noreturn)) void return_fast(T t)
{
static_assert(std::is_standard_layout_v<T> && std::is_trivial_v<T> && sizeof(T) <= sizeof(__UINTPTR_TYPE__),
"Return value must be trivial and fit in a register");
register T a0 asm("a0") = t;
asm volatile(".insn i SYSTEM, 0, x0, x0, 0x7ff" :: "r"(a0));
__builtin_unreachable();
}

Using return_fast() and return_fast<T>(T t) we can return from a function the way we would normally do using those functions in place of a regular return statement. If the compiler can see that we always end up with invoking a noreturn function, then it will omit the function teardown, and perhaps not save any callee-saved registers, saving a bunch of instructions. However, since this is not a visible return, the compiler cannot destruct objects, so be aware of that.

Let’s start with a function that isn’t slow by any means, stdBuildDoor. What it does it do all the things needed after a door has been built. Play the right sounds, return any changes to the door depending on what you need.

Block stdBuildDoor(int x, int y, int z, Block blk, Block old)
{
if (blk.getID() == old.getID()) {
if (blk.getExtra() & 4)
Sound::play("door_open", x, y, z);
else
Sound::play("door_close", x, y, z);
}
return blk;
}

I haven’t really made the doors very fancy, I admit. They change form when you activate them, which triggers this callback you see above, where it’s the same block being built on the same block, but with different extra bits. In short, it’s a callback that sees the door was modified and plays a sound.

Deceptively short in code, but horribly long in machine instructions, and as you will see it has a prologue and epilogue ending with a return instruction:

0000000000407700 <stdBuildDoor>:
stdBuildDoor():
407700: f7010113 addi sp,sp,-144
407704: 08813023 sd s0,128(sp)
407708: 08113423 sd ra,136(sp)
40770c: 00068413 mv s0,a3
407710: 0807473b zext.h a4,a4
407714: 0806c7bb zext.h a5,a3
407718: 00f70c63 beq a4,a5,407730 <stdBuildDoor+0x30>
40771c: 08813083 ld ra,136(sp)
407720: 00040513 mv a0,s0
407724: 08013403 ld s0,128(sp)
407728: 09010113 addi sp,sp,144
40772c: 00008067 ret
407730: 06913c23 sd s1,120(sp)
407734: 07213823 sd s2,112(sp)
407738: 07313423 sd s3,104(sp)
40773c: 07413023 sd s4,96(sp)
407740: 05513c23 sd s5,88(sp)
407744: 05613823 sd s6,80(sp)
407748: 02d41793 slli a5,s0,0x2d
40774c: 00050913 mv s2,a0
407750: 00058493 mv s1,a1
407754: 00060693 mv a3,a2
407758: 0807ce63 bltz a5,4077f4 <stdBuildDoor+0xf4>
40775c: 00429ab7 lui s5,0x429
407760: 04010a13 addi s4,sp,64
407764: 03413823 sd s4,48(sp)
407768: 010a8513 addi a0,s5,16 # 429010 <_ZTVN10__cxxabiv117__class_type_infoE+0x1420>
40776c: 24400893 li a7,580
407770: 010a8b13 addi s6,s5,16
407774: 00000073 ecall
407778: 00f00793 li a5,15
40777c: 00050993 mv s3,a0
407780: 10a7ee63 bltu a5,a0,40789c <stdBuildDoor+0x19c>
407784: 00100793 li a5,1
407788: 000a0513 mv a0,s4
40778c: 0cf99c63 bne s3,a5,407864 <stdBuildDoor+0x164>
407790: 06400793 li a5,100
407794: 04f10023 sb a5,64(sp)
407798: 01350533 add a0,a0,s3
40779c: 03313c23 sd s3,56(sp)
4077a0: 00050023 sb zero,0(a0)
4077a4: 03013503 ld a0,48(sp)
4077a8: 00048613 mv a2,s1
4077ac: 00090593 mv a1,s2
4077b0: b70f90ef jal ra,400b20 <sys_sound_play>
4077b4: 03013503 ld a0,48(sp)
4077b8: 01450863 beq a0,s4,4077c8 <stdBuildDoor+0xc8>
4077bc: 04013583 ld a1,64(sp)
4077c0: 00158593 addi a1,a1,1
4077c4: 7fd010ef jal ra,4097c0 <_ZdlPvm>
4077c8: 08813083 ld ra,136(sp)
4077cc: 00040513 mv a0,s0
4077d0: 08013403 ld s0,128(sp)
4077d4: 07813483 ld s1,120(sp)
4077d8: 07013903 ld s2,112(sp)
4077dc: 06813983 ld s3,104(sp)
4077e0: 06013a03 ld s4,96(sp)
4077e4: 05813a83 ld s5,88(sp)
4077e8: 05013b03 ld s6,80(sp)
4077ec: 09010113 addi sp,sp,144
4077f0: 00008067 ret
4077f4: 00429ab7 lui s5,0x429
4077f8: 02010a13 addi s4,sp,32
4077fc: 01413823 sd s4,16(sp)
407800: 000a8513 mv a0,s5
407804: 24400893 li a7,580
407808: 000a8b13 mv s6,s5
40780c: 00000073 ecall
407810: 00f00793 li a5,15
407814: 00050993 mv s3,a0
407818: 0aa7e463 bltu a5,a0,4078c0 <stdBuildDoor+0x1c0>
40781c: 00100793 li a5,1
407820: 000a0513 mv a0,s4
407824: 04f99e63 bne s3,a5,407880 <stdBuildDoor+0x180>
407828: 06400793 li a5,100
40782c: 02f10023 sb a5,32(sp)
407830: 01350533 add a0,a0,s3
407834: 01313c23 sd s3,24(sp)
407838: 00050023 sb zero,0(a0)
40783c: 01013503 ld a0,16(sp)
407840: 00048613 mv a2,s1
407844: 00090593 mv a1,s2
407848: ad8f90ef jal ra,400b20 <sys_sound_play>
40784c: 01013503 ld a0,16(sp)
407850: f7450ce3 beq a0,s4,4077c8 <stdBuildDoor+0xc8>
407854: 02013583 ld a1,32(sp)
407858: 00158593 addi a1,a1,1
40785c: 735010ef jal ra,409790 <_ZdlPvm>
407860: f69ff06f j 4077c8 <stdBuildDoor+0xc8>
407864: f2098ae3 beqz s3,407798 <stdBuildDoor+0x98>
407868: 010a8593 addi a1,s5,16 # 429010 <_ZTVN10__cxxabiv117__class_type_infoE+0x1420>
40786c: 00098613 mv a2,s3
407870: 23f00893 li a7,575
407874: 00000073 ecall
407878: 03013503 ld a0,48(sp)
40787c: f1dff06f j 407798 <stdBuildDoor+0x98>
407880: fa0988e3 beqz s3,407830 <stdBuildDoor+0x130>
407884: 000a8593 mv a1,s5
407888: 00098613 mv a2,s3
40788c: 23f00893 li a7,575
407890: 00000073 ecall
407894: 01013503 ld a0,16(sp)
407898: f99ff06f j 407830 <stdBuildDoor+0x130>
40789c: 00c13423 sd a2,8(sp)
4078a0: 04054463 bltz a0,4078e8 <stdBuildDoor+0x1e8>
4078a4: 00150513 addi a0,a0,1
4078a8: 02054e63 bltz a0,4078e4 <stdBuildDoor+0x1e4>
4078ac: 284030ef jal ra,40ab30 <_Znwm>
4078b0: 00813683 ld a3,8(sp)
4078b4: 02a13823 sd a0,48(sp)
4078b8: 05313023 sd s3,64(sp)
4078bc: fadff06f j 407868 <stdBuildDoor+0x168>
4078c0: 02054463 bltz a0,4078e8 <stdBuildDoor+0x1e8>
4078c4: 00150513 addi a0,a0,1
4078c8: 00054e63 bltz a0,4078e4 <stdBuildDoor+0x1e4>
4078cc: 00c13423 sd a2,8(sp)
4078d0: 260030ef jal ra,40ab30 <_Znwm>
4078d4: 00813683 ld a3,8(sp)
4078d8: 00a13823 sd a0,16(sp)
4078dc: 03313023 sd s3,32(sp)
4078e0: fa5ff06f j 407884 <stdBuildDoor+0x184>
4078e4: 8ddf80ef jal ra,4001c0 <_ZSt17__throw_bad_allocv>
4078e8: 00428537 lui a0,0x428
4078ec: 0b050513 addi a0,a0,176 # 4280b0 <_ZTVN10__cxxabiv117__class_type_infoE+0x4c0>
4078f0: a8df80ef jal ra,40037c <_ZSt20__throw_length_errorPKc>
4078f4: 00050413 mv s0,a0
4078f8: 01010513 addi a0,sp,16
4078fc: bd9fc0ef jal ra,4044d4 <_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE10_M_disposeEv>
407900: 00040513 mv a0,s0
407904: 1c1130ef jal ra,41b2c4 <_Unwind_Resume>
407908: 00050413 mv s0,a0
40790c: 03010513 addi a0,sp,48
407910: bc5fc0ef jal ra,4044d4 <_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE10_M_disposeEv>
407914: 00040513 mv a0,s0
407918: 1ad130ef jal ra,41b2c4 <_Unwind_Resume>

It has a prologue and an epilogue (144 stack bytes), exception throwing code (std::bad_alloc), a regular return statement in the middle there. And so on. I will say though, that it is not necessarily slow. I measured the function to cost ~72ns when I benchmarked it. This is purely a premature optimization I’m doing so that I can use and benchmark the return_fast() function against a real script function. This is not like a previous blog post where I was optimizing a guest function that was called an obscene amount of times.

As I alluded to earlier, we cannot use return_fast() in this function yet as there are temporary strings with construction and destruction code. _ZdlPvmis the sized delete operator, for example. So, just for this I will change the std::string to a std::string_view in order to avoid allocations.

Fully inlined, noreturn variant

I made it use return_fast(), I stopped using const std::string& and instead used std::string_view, and finally I made Sound::play use inline assembly:

0000000000406de4 <stdBuildDoor>:
stdBuildDoor():
406de4: 00068793 mv a5,a3
406de8: 00060893 mv a7,a2
406dec: 00050313 mv t1,a0
406df0: 00058693 mv a3,a1
406df4: 0807c63b zext.h a2,a5
406df8: 0807483b zext.h a6,a4
406dfc: 01060663 beq a2,a6,406e08 <stdBuildDoor+0x24>
406e00: 00078513 mv a0,a5
406e04: 7ff00073 .4byte 0x7ff00073 # return_fast()
406e08: 02d79713 slli a4,a5,0x2d
406e0c: 02075063 bgez a4,406e2c <stdBuildDoor+0x48>
406e10: 00428837 lui a6,0x428
406e14: 4d080513 addi a0,a6,1232 # 4284d0 <_ZTVN10__cxxabiv117__class_type_infoE+0x13e0>
406e18: 00900593 li a1,9
406e1c: 00030613 mv a2,t1
406e20: 00088713 mv a4,a7
406e24: 0040005b .4byte 0x40005b # sys_sound_play()
406e28: fd9ff06f j 406e00 <stdBuildDoor+0x1c>
406e2c: 00428837 lui a6,0x428
406e30: 4e080513 addi a0,a6,1248 # 4284e0 <_ZTVN10__cxxabiv117__class_type_infoE+0x13f0>
406e34: 00a00593 li a1,10
406e38: 00030613 mv a2,t1
406e3c: 00088713 mv a4,a7
406e40: 0040005b .4byte 0x40005b # sys_sound_play()
406e44: fbdff06f j 406e00 <stdBuildDoor+0x1c>

So, it was reduced to 1/4th of its size down to 24 instructions, and the function is just as safe to use, and works exactly the same way. All the changes I made were about ~1min of tinkering because the inline assembly for Sound::play is already generated at build time. Every dynamic call has an opaque and an inline variant, and I just have to pick one. I often start with the opaque variant for non-trivial functions, as it is more reliable than the inline assembly, which is still a little bit in the testing stage.

Benchmarking it

Alright, let’s take this final function and measure the difference between return_fast and a simple return statement.

Block stdBuildDoor1(int x, int y, int z, Block blk, Block old)
{
if (blk.getID() == old.getID()) {
if (blk.getExtra() & 4)
isys_for_test("door_open", 9, x, y, z);
else
isys_for_test("door_close", 10, x, y, z);
}
return blk;
}
Block stdBuildDoor2(int x, int y, int z, Block blk, Block old)
{
if (blk.getID() == old.getID()) {
if (blk.getExtra() & 4)
isys_for_test("door_open", 9, x, y, z);
else
isys_for_test("door_close", 10, x, y, z);
}
return_fast(blk);
}

I created a dynamic call for testing that takes the same arguments as the sound playing function, except it does nothing. In essence, we are producing the exact same assembly, and will have the same overhead as before (with exception to playing sounds), and the changes did have a measurable impact:

> lowest 3ns  median: 3ns  highest: 3ns
[std] Measurement "Overhead" median: 3

> lowest 69ns median: 72ns highest: 76ns
[std] Measurement "BuildDoor" median: 72

> lowest 16ns median: 16ns highest: 18ns
[std] Measurement "BuildDoor1" median: 16

> lowest 15ns median: 15ns highest: 17ns
[std] Measurement "BuildDoor2" median: 15

“Overhead” here being the time cost of benchmarking an empty function, and BuildDoor being the original function before I slapped some optimizations onto it. So, with return_fast() (BuildDoor2) it is 15–3 = 12ns and with a regular return (BuildDoor1) its 16–3 = 13ns.

I’ve attached the assembly of the inlined variants here:

0000000000406d68 <stdBuildDoor1>:
stdBuildDoor1():
406d68: 00068793 mv a5,a3
406d6c: 0807483b zext.h a6,a4
406d70: 00050313 mv t1,a0
406d74: 00058693 mv a3,a1
406d78: 00060893 mv a7,a2
406d7c: 0807c73b zext.h a4,a5
406d80: 00e80663 beq a6,a4,406d8c <stdBuildDoor1+0x24>
406d84: 00078513 mv a0,a5
406d88: 00008067 ret
406d8c: 02d79713 slli a4,a5,0x2d
406d90: 02075263 bgez a4,406db4 <stdBuildDoor1+0x4c>
406d94: 00428837 lui a6,0x428
406d98: 43880513 addi a0,a6,1080 # 428438 <_ZTVN10__cxxabiv117__class_type_infoE+0x1248>
406d9c: 00900593 li a1,9
406da0: 00030613 mv a2,t1
406da4: 00088713 mv a4,a7
406da8: 03d0005b .4byte 0x3d0005b
406dac: 00078513 mv a0,a5
406db0: 00008067 ret
406db4: 00428837 lui a6,0x428
406db8: 44880513 addi a0,a6,1096 # 428448 <_ZTVN10__cxxabiv117__class_type_infoE+0x1258>
406dbc: 00a00593 li a1,10
406dc0: 00030613 mv a2,t1
406dc4: 00088713 mv a4,a7
406dc8: 03d0005b .4byte 0x3d0005b
406dcc: 00078513 mv a0,a5
406dd0: 00008067 ret

0000000000406dd4 <stdBuildDoor2>:
stdBuildDoor2():
406dd4: 00068793 mv a5,a3
406dd8: 00060893 mv a7,a2
406ddc: 00050313 mv t1,a0
406de0: 00058693 mv a3,a1
406de4: 0807c63b zext.h a2,a5
406de8: 0807483b zext.h a6,a4
406dec: 01060663 beq a2,a6,406df8 <stdBuildDoor2+0x24>
406df0: 00078513 mv a0,a5
406df4: 7ff00073 .4byte 0x7ff00073 # return_fast()
406df8: 02d79713 slli a4,a5,0x2d
406dfc: 02075063 bgez a4,406e1c <stdBuildDoor2+0x48>
406e00: 00428837 lui a6,0x428
406e04: 43880513 addi a0,a6,1080 # 428438 <_ZTVN10__cxxabiv117__class_type_infoE+0x1248>
406e08: 00900593 li a1,9
406e0c: 00030613 mv a2,t1
406e10: 00088713 mv a4,a7
406e14: 03d0005b .4byte 0x3d0005b
406e18: fd9ff06f j 406df0 <stdBuildDoor2+0x1c>
406e1c: 00428837 lui a6,0x428
406e20: 44880513 addi a0,a6,1096 # 428448 <_ZTVN10__cxxabiv117__class_type_infoE+0x1258>
406e24: 00a00593 li a1,10
406e28: 00030613 mv a2,t1
406e2c: 00088713 mv a4,a7
406e30: 03d0005b .4byte 0x3d0005b
406e34: fbdff06f j 406df0 <stdBuildDoor2+0x1c>

The return_fast() variant unfortunately jumps around a bit. I’m guessing the compiler is not treating it exactly as a return. We observe only a 8% run-time reduction. Oh well.

It gets way more interesting with an opaque function call in the mix. Turning the Sound::play() dynamic call into a regular function call:

0000000000406e58 <stdBuildDoor>:
stdBuildDoor():
406e58: ff010113 addi sp,sp,-16
406e5c: 00813023 sd s0,0(sp)
406e60: 00113423 sd ra,8(sp)
406e64: 080747bb zext.h a5,a4
406e68: 00068413 mv s0,a3
406e6c: 0806c73b zext.h a4,a3
406e70: 00e78c63 beq a5,a4,406e88 <stdBuildDoor+0x30>
406e74: 00813083 ld ra,8(sp)
406e78: 00040513 mv a0,s0
406e7c: 00013403 ld s0,0(sp)
406e80: 01010113 addi sp,sp,16
406e84: 00008067 ret
406e88: 02d41793 slli a5,s0,0x2d
406e8c: 00060713 mv a4,a2
406e90: 00058693 mv a3,a1
406e94: 00050613 mv a2,a0
406e98: 0207d463 bgez a5,406ec0 <stdBuildDoor+0x68>
406e9c: 00428537 lui a0,0x428
406ea0: 00900593 li a1,9
406ea4: 55050513 addi a0,a0,1360 # 428550 <_ZTVN10__cxxabiv117__class_type_infoE+0x1260>
406ea8: c79f90ef jal ra,400b20 <sys_sound_play>
406eac: 00813083 ld ra,8(sp)
406eb0: 00040513 mv a0,s0
406eb4: 00013403 ld s0,0(sp)
406eb8: 01010113 addi sp,sp,16
406ebc: 00008067 ret
406ec0: 00428537 lui a0,0x428
406ec4: 00a00593 li a1,10
406ec8: 56050513 addi a0,a0,1376 # 428560 <_ZTVN10__cxxabiv117__class_type_infoE+0x1270>
406ecc: c55f90ef jal ra,400b20 <sys_sound_play>
406ed0: 00813083 ld ra,8(sp)
406ed4: 00040513 mv a0,s0
406ed8: 00013403 ld s0,0(sp)
406edc: 01010113 addi sp,sp,16
406ee0: 00008067 ret

0000000000406ee4 <stdBuildDoor2>:
stdBuildDoor2():
406ee4: ff010113 addi sp,sp,-16
406ee8: 00813023 sd s0,0(sp)
406eec: 00113423 sd ra,8(sp)
406ef0: 00068413 mv s0,a3
406ef4: 080747bb zext.h a5,a4
406ef8: 0806c6bb zext.h a3,a3
406efc: 00f68663 beq a3,a5,406f08 <stdBuildDoor2+0x24>
406f00: 00040513 mv a0,s0
406f04: 7ff00073 .4byte 0x7ff00073
406f08: 02d41793 slli a5,s0,0x2d
406f0c: 00060713 mv a4,a2
406f10: 00058693 mv a3,a1
406f14: 00050613 mv a2,a0
406f18: 0007dc63 bgez a5,406f30 <stdBuildDoor2+0x4c>
406f1c: 00428537 lui a0,0x428
406f20: 00900593 li a1,9
406f24: 55050513 addi a0,a0,1360 # 428550 <_ZTVN10__cxxabiv117__class_type_infoE+0x1260>
406f28: bf9f90ef jal ra,400b20 <sys_sound_play>
406f2c: fd5ff06f j 406f00 <stdBuildDoor2+0x1c>
406f30: 00428537 lui a0,0x428
406f34: 00a00593 li a1,10
406f38: 56050513 addi a0,a0,1376 # 428560 <_ZTVN10__cxxabiv117__class_type_infoE+0x1270>
406f3c: be5f90ef jal ra,400b20 <sys_sound_play>
406f40: fc1ff06f j 406f00 <stdBuildDoor2+0x1c>

And I created a graph with inlined and opaque return/return_fast:

So, we only get a tiny 1ns (8%) reduction in run-time when everything is inlined, but a hefty 4ns (22%) run-time reduction when there are opaque function calls in the function. A lot of instructions are gone in the inlined variant compared to the opaque, and it seems to be mostly epilogue related.

Conclusion

Even though most of my C++ scripting gets inlined, as I have more or less gone that route now, I do think this feature is valuable as opaque dynamic calls are very reliable, and this feature did meaningfully improve the run-time of those functions. We can see that a lot of instructions are missing in the return_fast() variant with opaque function calls, including an epilogue. A 22% run-time reduction is nothing to sneeze at, and 8% for the fully inlined variant is still something. Just use it with caution.

It was still peanuts compared to converting const std::string& to std::string_view. String views are zero-copy in the host and can also be forwarded to standard library containers for when heterogenous lookups are enabled, a C++20 feature. Just from the numbers in this blog post, you can calculate the difference that the std::string_view change alone did: 72ns → 18ns is a brutal 75% run-time reduction.

Still, we discovered that there are uses for the STOP instruction, particularly when more performance is needed, as it did improve every function, even those that are simple register operations. And we also saw that it optimizes codegen, leaving out the epilogue. So, perhaps a good guideline is to use it only for those functions that are called millions of times.

-gonzo

--

--