It finally happened! A raspberry pi like device, with a RISC-V CPU supporting the v extension. Aka RVV. Aka vector instructions.

I bought one, and explored it a bit.

SIMD background

First some background on SIMD.

SIMD is a set of instructions allowing you to do the same operation to multiple independent pieces of data. As an example, say you had four 8 bit integers, and you wanted to multiply them all by 2, then add 1. You could do this with a single operation without any special instructions.

    # x86 example assembly.

    mov eax, [myvalues]  # load our four bytes.
    mov ebx, 2           # we want to multiply by two
    imul eax, ebx        # single operation, multiple data!
                         # After this, eax contains 0x02040608
    add eax, 0x01010101  # single operation, multiple data!
                         # After this, eax contains 0x03050709
    mov [myvalues], eax  # store back the new value.

section .data
  myvalues db 1,2,3,4

Success, right? No, of course not. This naive code doesn’t handle over/underflow, and doesn’t even remotely work for floating point data. For that, we need special SIMD instructions.

x86 and ARM have gone the way of fixed sized registers. In 1997 Intel introduced MMX, to great fanfare. The PR went all “it’s multimedia!”. “Multimedia” was a buzzword at the time. This first generation gave you a whopping 64 bit register size, that you could use for one 64-bit value, two 32-bit values, four 16-bit values, or 8 8-bit values.

A “batch size” of 64 bit, if you prefer.

These new registers got a new set of instructions, to do these SIMD operations. I’m not going to learn the original MMX instructions, but it should look something like this:

  movq mm0, [myvalues]  # load values
  movq mm1, [addconst]  # load our const addition values.
  paddb mm0, mm0        # add to itself means multiply by 2
  paddb mm0, mm1        # Add vector of ones.
  movq [myvalues], mm0  # store the updated value.
  emms                  # state reset.

section .data
  myvalues db 1,2,3,4
  addconst db 1,1,1,1

So far so good.

The problem with SIMD

The problem with SIMD is that it’s so rigid. With MMX, the registers are 64 bits. No more, no less. Intel followed up with SSE, adding floating point support and doubling the register size to 128 bits. That’s four 32-bit floats in one xmm register.

So now we have 64-bit mm registers, 128-bit xmm registers, and uncountably many instructions to work with these two sets of new registers.

Then we got SSE2, SSE3, SSE4. Then AVX, AVX2 (256 bit registers), and even AVX-512 (512 bit registers).

512 bit registers. Not bad. You can do 16 32-bit floating point operations per instruction with that.

But here’s the problem: Only if your code was written to be aware of these new registers and instructions! If your production environment uses the best of the best, with AVX-512, you still can’t use that, if your development/QA environment only has AVX2. Not without having separate binaries.

Or you could maintain 20 different compiled versions, and dynamically choose between them. That’s what volk does. Or you could compile with -march=native (gcc) or -Ctarget-cpu=native (Rust), and create binaries that only work on machines at least as new as the one you built on.

But none of these options allow you to build a binary that will automatically take advantage of future processor improvements.

Vector instructions do.

Vector instructions

Instead of working with fixed sized batches, vector instructions let you specify the size of the data, and the CPU will do as many at once as it can.

Here’s an example:

  # Before we enter the loop, input registers are set thusly:
  # a0: number of elements to process.
  # a1: pointer to input elements.
  # a2: pointer to output elements.

  # Load 2.0 once. We'll need it later.
  flw ft0, two
loop:
  # prepare the CPU to process a batch:
  # * a0: of *up to* a0 elements,
  # * e32: each element is 32 bits. The spec calls this SEW.
  # * m1: group registers in size of 1 (I'll get to this). LMUL in the spec.
  # * ta & ma: ignore these flags, they're deep details.
  #
  # t0 will be set to the number of elements in a "batch"
  vsetvli t0, a0, e32, m1, ta, ma

  # Set t1 to be the number of bytes per "batch".
  # t1 = t0 << 2
  slli t1, t0, 2

  # Load a batch.
  vle32.v v0, (a0)

  # Multiply them all by 2.0.
  vfmul.vf v0, v0, ft0

  # Store them in the output buffer.
  vse32.v v0, (a2)

  # Update pointers and element counters.
  add a1, a1, t1
  add a2, a2, t1
  sub a0, a0, t0

  # Loop until a0 is zero.
  bnez a0, loop

two:    .float 2.0

Write once, and when vector registers get bigger, your code will automatically perform more multiplies per batch. That’s great! You can use an old and slow RISC-V CPU for development, but then when you get your big beefy machine the code is ready to go full speed.

The RISC-V vector spec allows for vector registers up to 64Kib=8KiB, or 2048 32-bit floats. And with m8 (see below), that allows e.g. 16384 32-bit floats being multiplied by 16384 other floats, and then added added to yet more 16384 floats, in a single fused multiply-add instruction.

Even more batching

RISC-V has 32 vector registers. On a particular CPU, each register will be the same fixed size, called VLEN. But the instruction set allows us to group the registers, creating mega-registers. That’s what the m1 in setvli is about.

If we use m8 instead of m1, that gives you just four vector registers: v0, v8, v16, and v24. But in return they are 8 times as wide.

The spec calls this batching number LMUL.

Basically a pairwise floating point vector multiplication vfmul.vf v0, v0, v8 in m8 mode effectively represents:

   vfmul.vf v0, v0, v8
   vfmul.vf v1, v1, v9
   vfmul.vf v2, v2, v10
   vfmul.vf v3, v3, v11
   vfmul.vf v4, v4, v12
   vfmul.vf v5, v5, v13
   vfmul.vf v6, v6, v14
   vfmul.vf v7, v7, v15

Bigger batching, at the cost of fewer registers. I couldn’t come up with a nice way to multiply two complex numbers with only four registers. Maybe you can? If so, please send a PR adding a mul_cvec_asm_m8_segment function to my repo. Until then, the m4 version is the biggest batch. m8 may still not be faster, since the m2 version of mul_cvec_asm_mX_segment is a little bit faster than the m4 version in my test.

Like with SIMD, there are convenient vector instructions for loading and handling data. For example, if you have a vector of complex floats, then you probably have real and imaginary values alternating. vlseg2e32.v v0, a0 will then load the real values in v0, and the imaginary values in v1.

Or, if vsetvli was called with m4, the real values will be in v0 through v3, and imaginary values in v4 through v7.

Curiously a “stride” load (vlse32.v v0, (a0), t1), where you can do things like “load every second float”, seems to not have very good performance. Maybe this is specific to the CPU I’m using. I would have expected the L1 cache to make them fairly equal, but apparently not.

So yes, it’s not perfect. On a future CPU it may be cheap to load from L1 cache, so your code should be more wasteful about vector registers, to be optimal. Maybe on a future CPU the stride load is faster than the segmented load. There’s no way to know.

The Orange Pi RV2

It seems that the CPU, a Ky X1, isn’t known to llvm yet. So you have to manually enable the v extension when compiling. But that’s fine.

$ cat ~/.cargo/config.toml
[target.riscv64gc-unknown-linux-gnu]
rustflags = ["-Ctarget-cpu=native", "-Ctarget-feature=+v"]

I filed a bug with Rust about it, but it seems it may be a missing LLVM feature. It’s apparently not merely checking /proc/cpuinfo for the features in question, but needs the name of the CPU in code or something.

It seems that the vector registers (VLEN) on this hardware are 256 bit wide. This means that with m8 a single multiplication instruction can do 8*256/32=64 32-bit floating point operations. Multiplying two vector registers in one instructions multiplies half a kibibyte (256 bytes per aggregate register).

64 32bit operations is a lot. We started off in SIMD with just 2. And as I say above, when future hardware gets bigger vector registers, you won’t even have to recompile.

That’s not to say that the Orange Pi RV2 is some sort of supercomputer. It’s much faster than the VisionFive 2, but your laptop is much faster still.

So how much faster is it?

I started a Rust crate to test this out.

$ cargo +nightly bench --target  target-riscv64-no-vector.json -Zbuild-std
[…]
running 10 tests
test bench_mul_cvec_asm_m2_segment ... bench:       2,930.20 ns/iter (+/- 29.96)
test bench_mul_cvec_asm_m4_segment ... bench:       3,036.18 ns/iter (+/- 100.35)
test bench_mul_cvec_asm_m4_stride  ... bench:       4,713.20 ns/iter (+/- 55.09)
test bench_mul_cvec_asm_m8_stride  ... bench:       5,368.08 ns/iter (+/- 15.18)
test bench_mul_cvec_rust           ... bench:       9,957.66 ns/iter (+/- 76.39)
test bench_mul_cvec_rust_v         ... bench:       3,020.23 ns/iter (+/- 21.72)
test bench_mul_fvec_asm_m4         ... bench:         843.94 ns/iter (+/- 22.86)
test bench_mul_fvec_asm_m8         ... bench:         801.36 ns/iter (+/- 27.71)
test bench_mul_fvec_rust           ... bench:       4,097.09 ns/iter (+/- 29.77)
test bench_mul_fvec_rust_v         ... bench:       1,084.47 ns/iter (+/- 13.46)

The “rust” version is normal Rust code. The _v code is Rust code where the compiler is allowed to use vector instructions. As you can see, Rust (well, LLVM) is already pretty good. But my hand coded vector assembly is faster still.

As you can see, the answer for this small benchmark is “about 3-5 times faster”. That multiplier probably goes up the more operations that you do. These benchmarks just do a couple of multiplies.

Note that I’m cheating a bit with the manual assembly. It assumes that the input is an even multiple of the batch size.

Custom target with Rust

To experiment with target settings, I modified the target spec. This is not necessary for normal code (you can just force-enable the v extension, per above), but could be interesting to know. In my case I’m actually using it to turn vectorization off by default, since while Rust lets you enable a target feature per function, it doesn’t let you disable it per function.

rustup install nightly
rustup default nightly
rustc  -Zunstable-options \
  --print target-spec-json \
  --target=riscv64gc-unknown-linux-gnu \
  > mytarget.json
rustup default stable
# edit mytarget.json
rustup component add --toolchain nightly rust-src
cargo +nightly bench --target mytarget.json -Zbuild-std

Did anything not “just work”?

I have found that the documentation for the RISC-V vector instructions are a bit lacking, to say the least. I’m used to reading specs, but this one is a bit extreme.

vlseg2e32.v v0, (a0) fails with illegal instructions when in m8 mode. That’s strange. Works fine in m1 through m4.

Can we check the specs on the CPU? Doesn’t look like it. It’s a “Ky X1”. That’s all we’re told. Is it truly RVV 1.0? Don’t know. Who even makes it? Don’t know. I see guesses and assertions that it may be by SpacemiT, but they don’t list it on their website. Maybe it’s a variant of the K1? Could be. The English page doesn’t load, but the Chinese page seems to have similar marketing phrases as Orange Pi RV2 uses for its CPU.

Ah, maybe this is blocked by the spec:

The EMUL setting must be such that EMUL * NFIELDS ≤ 8, otherwise the instruction encoding is reserved. […] This constraint makes this total no larger than 1/4 of the architectural register file, and the same as for regular operations with EMUL=8.

So for some reason you can’t load half the register space in a single instruction. Oh well, I guess I have to settle for loading 256 bytes at a time.

It’s a strange requirement, though. The instruction encoding allows it, but it just doesn’t do the obvious thing.

ARM

ARM uses SIMD like Intel. I vaguely remember that it also has vector instructions (SVE?), but I’ve not looked into it.

Can I test this out without hardware?

Yes, qemu has both RISC-V support and support for its vector instructions. I didn’t make precise notes when I tried this out months ago, but this should get you started.

tl;dr

Vector instructions are great. I wasn’t aware of this register grouping, and I love it.