libriscv: Basic vectorization with V-extension

fwsGonzo
6 min readAug 17, 2022

--

Using a tiny portion of the Vector instruction set extension to see what kind of performance improvements we can expect (in an emulator)

In my last post on this subject I was using multiprocessing to improve dot-product calculation times in a game engine scripting environment.

With the RISC-V Vector Extension being standardized and having the ability to at least write the assembly thanks to changes in binutils, I could implement parts of the spec in my emulator in order to see what kind of speedups I can expect to see.

The old results

In the previous post I had the results shown above. Since then I have improved the performance of the emulator again, and now the baseline performance with multiprocessing is 37 microseconds, compared to something like 70-ish before. That is, without binary translation. So, I think in the new graph we should forget the old, and compare instead with and without vector instructions, multi-processing and be mindful about multi-threading overhead.

Inline assembly

The current version of GCC (trunk) does not have RVV support (although binutils does), which made things way easier for me as it wouldn’t generate random RVV instructions all over the place, forcing me to implement half the spec. It was a bit painful being unable to use the vector registers at all in extended inline assembly, but for these benchmarks I could just hand-write everything. The spec is huge and complicated, so I was very happy to just write the instructions myself. I avoided the vector configuration instructions, and instead I will just be relying on my known defaults. I ended up using v0 as zeroes, although it is just a regular vector register.

union alignas(32) v256 {
float f[8];
__attribute__((naked))
static inline void zero_v1()
{
asm("vxor.vv v1, v1, v1");
asm("ret");
}
__attribute__((naked))
inline float sum_v1() {
asm("vfredusum.vs v1, v0, v1");
asm("vse32.v v1, %1"
: "=m"(this->f[0])
: "m"(this->f[0]));
asm("flw fa0, %0" : : "m"(f[0]) : "fa0");
asm("ret");
}
};

It’s a simple union that holds 8x 32-bit elements as a 256-bit aligned vector. In order to perform a dot-product I am loading the A and B arrays into V2 and V3 using vle32.v in a tight loop that you can see below. I multiply-add with vfmul.vv and vfadd.vv. I did see that there was a dot-product instruction in the earlier specs, but I couldn’t see anything like that anymore so I am just assuming that these ops are fusable on desktop processors. I’m not the steadiest hand when it comes to inline assembly, so please excuse any mistakes.

The instructions I am using are the basic floating-point multiply and add vector instructions. vfmul.vv is a vector-vector multiplication with another vector as destination. If instead it was vfmul.vf it would be vector-scalar, allowing me to multiply all elements with a single value.

v256::zero_v1();
// Vectorized dot product
for (size_t i = start; i < end; i += 16) {
v256 *a = (v256 *)&work.data_a[i];
v256 *b = (v256 *)&work.data_b[i];
v256 *c = (v256 *)&work.data_a[i + 8];
v256 *d = (v256 *)&work.data_b[i + 8];
asm("vle32.v v2, %1"
:
: "r"(a->f), "m"(a->f[0]));
asm("vle32.v v3, %1"
:
: "r"(b->f), "m"(b->f[0]));
asm("vfmul.vv v2, v2, v3");
asm("vfadd.vv v1, v1, v2");
asm("vle32.v v2, %1"
:
: "r"(c->f), "m"(c->f[0]));
asm("vle32.v v3, %1"
:
: "r"(d->f), "m"(d->f[0]));
asm("vfmul.vv v2, v2, v3");
asm("vfadd.vv v1, v1, v2");
}
// Sum elements
v256 vsum;
const float sum = vsum.sum_v1();

The last line probably looks funky. It loads v1 into vsum, and then extracts the first element into fa0 (1st floating-point return register), so that I can safely retrieve it without much ado. The summing is performed in sum_v1() by using the vfredusum.vs instruction which sums every element in v0 and v1 and stores the result in v1[0]. As you may remember, I chose to use v0 as zeroes and that allows me to avoid some extra setup.

I haven’t quite understood what the sanctioned set1ps method of RVV is yet. Perhaps it’s the 32-bit XOR.VV integer-op? I ended up not using a naked assembly function in order to unroll the loop a little bit, and I implemented VXOR.VV at the same time, zeroing v1.

The self-tests in the program report that the dot-product of 4096 fp32 values all initialized to 1.0 is 4096.0. Nice!

Internal implementation

Internally I have implemented the vector lanes as 32-byte aligned unions where you can access elements as each type. The instructions that perform operations on each element do so in a for loop, for maximum portability. It is most likely being vectorized by the compiler. I have made every effort to make it happen.

Due to the size of the registers I have put them under a unique_ptr, and I am considering lazily allocating them on first usage. For now, the RISCV_EXT_V CMake option is disabled by default. The vector extension causes RISC-V machines to consume 1kB more memory (per CPU).

Loading and storing the vectors are done as a single memory operation to 32-byte aligned page memory, which resulted in speedups across the board on all operations, such as memory copying.

The new results

It was crucial to measure and account for overhead, because as we can see, the vectorized function is actually 4–5x faster than the SISD multi-processing function (with all the new fancy optimizations). The overhead of multi-processing is making things look closer due to the limited work size. In fact, the single-processing vectorized dot-product is the fastest due to the overhead. Although the actual work is 2x faster when multi-processing, so with a larger work amount it would win out. The numbers with overhead taken out are 12 micros for vectorized dot-product and 6 micros for the multi-processing vectorized variant. The naive single-processing version runs at an abysmal 85 micros, 7x slower than vectorizing it. I think these are incredible numbers, and it really shows how specialized loops can be a total game changer.

STREAM benchmark

I have a STREAM memory benchmark in the repository that I am using to compare how the memory subsystem is faring. Not well, as it’s just a std::unordered_map of pages. Still, I like it that way. It makes it very easy to work with the memory of the emulator, and with the concept of loaning pages and copy-on-write, forking a machine is as a result extremely quick.

I decided to implement some of the scalar vector instructions too, and with that I could add tuned functions for the STREAM benchmark.

#define VLOAD(e, vec) asm("vle32.v "#vec", %1" : : "r"(e), "m"(e))#define VSTORE(e, vec) asm("vse32.v "#vec", %1" : "=m"(e) : "m"(e))#define FLOAD(reg, sc) asm("fmv.s.x "#reg", %0" : : "r"(sc) : #reg)

With a few helper functions to help me load from memory and store a 32-bit float in a known register, I could easily write the assembly. STREAM already supports tuned functions by simply defining TUNED.

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy: 3548.1 0.018178 0.018038 0.018272
Scale: 3152.3 0.020406 0.020303 0.020519
Add: 3287.3 0.029464 0.029203 0.029562
Triad: 3038.5 0.031905 0.031595 0.033001

The results are pretty good, showing that lane-sized memory operations and caching in the memory subsystem is working well right now!

On the whole a great success. I look forward to perusing the spec further and seeing if there are other interesting things I can do with this.

-gonzo

--

--

fwsGonzo
fwsGonzo

No responses yet