RISC-V optimization and -mtune

I’ve been getting into RISC-V optimization recently. Partly because I got my StarFive VisionFive 2, and partly because unlike x86 the number of RISC-V instructions is so managable that I may actually have a chance at beating the compiler.

I’m optimizing the inner loops of GNURadio, or in other words the volk library. I’ve been getting up to a about a doubling of the speed compared to the compiled C code, depending on the function.

But it got me thinking how far I could tweak the compiler and its options, too.

Yes, I should have done this much sooner.

Many years ago now I built some data processing thing in C++, and thought it ran too slowly. Sure, I did a debug build, but how much slower could that be? Half speed? Nope. 20x slower.

Of course this time I never compared to a debug build, so don’t expect that kind of difference. Don’t expect that it’ll reach my hand optimized assembly either, imperfect as it may be.

The test code

This may look like a synthetic benchmark, in simplified C++:

complex volk_32fc_x2_dot_prod_32fc_generic(const vector<complex> &in1,
                                           const vector<complex> &in2)
{
  complex res;
  for (unsigned int i = 0; i < in1.size(); i++) {
    res += in1[i] * in2[i];
  }
  return ret;
}

The actual C code is a bit more complex, because it’s been unrolled. Whether that’s needed or not, or indeed makes things worse I don’t know.

This is not a contrived benchmark example I’m optimizing. A doubling in performance directly corresponds to a doubling of the signal bandwidth that can be handled by a FIR filter without needing to drop samples.

Ok, so in most cases I would use an FFT filter, whose performance is more dominated by the FFT and inverse FFT than by the volk parts.

Maybe optimizing the FFT library should be on my list…

GCC 12.2.0

First let’s see what good ol’ GCC will do:

Default options:

RUN_VOLK_TESTS: volk_32fc_x2_dot_prod_32fc(131071,1987)
sifive_u74 completed in 1938.08 ms  <--- Hand coded in assembly.
generic completed in 2718.17 ms    <--- C compiled version.
a_generic completed in 2700.02 ms
Best aligned arch: sifive_u74
Best unaligned arch: sifive_u74

Supposedly optimized for my CPU:

$ CXX=g++-12 \
CC=gcc-12 \
CXXFLAGS="-O3 -march=rv64gc -mtune=sifive-u74" \
CFLAGS="-O3 -march=rv64gc -mtune=sifive-u74" \
cmake -DCMAKE_INSTALL_PREFIX=$HOME/opt/volk ..
$ make -j4
$ make install
$ LD_LIBRARY_PATH=$HOME/opt/volk/lib  ~/opt/volk/bin/volk_profile -R 32fc_x2_dot_prod_32fc
RUN_VOLK_TESTS: volk_32fc_x2_dot_prod_32fc(131071,1987)
sifive_u74 completed in 2001.66 ms
generic completed in 2637.79 ms    <--- C
a_generic completed in 2630.31 ms
Best aligned arch: sifive_u74
Best unaligned arch: sifive_u74

3% better. Within the margin of error. Let’s just call it no difference.

clang 13.0.1

This is the clang that the VisionFive2 root came with.

Default options, except CC/CXX set to clang/clang++.

RUN_VOLK_TESTS: volk_32fc_x2_dot_prod_32fc(131071,1987)
sifive_u74 completed in 1996.03 ms
generic completed in 5559.72 ms   <-- Yikes!
a_generic completed in 5534.04 ms
Best aligned arch: sifive_u74
Best unaligned arch: sifive_u74

That’s less than half the speed of GCC!

But with tuning:

RUN_VOLK_TESTS: volk_32fc_x2_dot_prod_32fc(131071,1987)
sifive_u74 completed in 2013.59 ms
generic completed in 2987.79 ms  <---
a_generic completed in 2939.02 ms
Best aligned arch: sifive_u74
Best unaligned arch: sifive_u74

~10% worse than GCC. That’s much better than without -mtune, though. A huge difference, even though clang compiled for the same architecture.

Clang trunk

Commit: 73c258048e048b8dff0579b8621aa995aab408d4
Date: 2023-04-17

Build clang trunk

I followed the LLVM Getting Started guide.

$ git clone https://github.com/llvm/llvm-project.git
[…]
$ cd llvm-project
$ cmake \
  -S llvm \
  -B build \
  -G Ninja \
  -DLLVM_ENABLE_PROJECTS='clang;lld' \
  -DCMAKE_INSTALL_PREFIX=$HOME/opt/clang-trunk \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLVM_PARALLEL_COMPILE_JOBS=$(nproc) \
  -DLLVM_PARALLEL_LINK_JOBS=$(nproc)
$ time ninja -C build -j$(nproc)
real    586m8.006s     <-- aka ~10h
user    2225m44.714s   <-- 37 CPU-hours
sys     95m51.166s
$ ninja -C build install

Yeah that took a while. Sure, I could have cross-compiled it, but I just started it in the morning before work, and it finished by the time I needed it.

Default settings:

RUN_VOLK_TESTS: volk_32fc_x2_dot_prod_32fc(131071,1987)
sifive_u74 completed in 1996.35 ms
generic completed in 5485.2 ms   <--- Yikes, still.
a_generic completed in 5473.75 ms
Best aligned arch: sifive_u74
Best unaligned arch: sifive_u74

Tuned:

RUN_VOLK_TESTS: volk_32fc_x2_dot_prod_32fc(131071,1987)
sifive_u74 completed in 1957.5 ms
generic completed in 2297.44 ms  <---- Yay!
a_generic completed in 2276.93 ms
Best aligned arch: sifive_u74
Best unaligned arch: sifive_u74

Oh wow, that’s pretty good. My hand coded assembly is just ~15% better. Clang trunk beat GCC 12.2.0.

But also: phew, I didn’t waste my time. It would have sucked to see clang beating the hand coded assembly.

But I am a bit surprised. The U74 is not a complex implementation. I’m surprised there’s anything to tune. But looking at the assembly, the untuned code is crap. Like, what’s this shit?

        fadd.s  fa3,fa3,fa1
        fneg.s  fa1,ft0
        fmul.s  fa1,ft1,fa1
        fmadd.s fa1,fa0,ft2,fa1
        fadd.s  fa4,fa4,fa1
        fmul.s  fa1,ft2,ft0
        fmadd.s fa1,fa0,ft1,fa1
        fadd.s  fa5,fa5,fa1

Additions? Negation? Non-fused multiplies? Of course that’s less efficient. More instructions, too:

   87 clang-trunk-default.txt
  243 clang-trunk-tuned.txt
   51 hand-written.txt

The tuned version also has a bunch of needless instructions. This function should inherently boil down to only fused multiply-adds (fmadd.s and fnmsub.s), for the floating point.

Summary

For my tiny sample here I can say that -mtune-ing for the sifive-u74 did nothing on GCC, but doubled the speed on clang.

Interesting.

Vector instructions

I’m really looking forward to the next generation of RISC-V hardware, that should have the vector instruction set. That’s likely going to give much more than doubling of CPU speed for DSP (digital signals processing).

Vector instructions are like SIMD, but more general. In short, SIMD instructions let you take four input elements, do the same operation on these four at the same time, then store all four back to memory. Then the next generation of SIMD increases that to eight. But because it’s a new set of instructions to do “eight at a time”, all software needs to be rewritten to take advantage of the newest SIMD.

Vector instructions instead let the programmer tell the CPU to “take as many as you can” at a time. As new CPUs get the ability to do more, they automatically do, without any need for software updates.

Clang trunk seems to be able to generate the instructions already, which is great! I expect some optimization to still be possible manually, but there will likely be diminishing returns.

I’ll start experimenting with these vector instructions as soon as I get hardware for it.

Comments also on this reddit post.