I found some old C code of mine from around 2001 or so. I vaguely remember trying to make it as optimized as possible. Sure, I was still a teenager, so it’s not state of the art. But it’s not half bad. I vaguely suspect I could do better with better optimization for cache lines, but it’s pretty good.

On my current laptop it does about 12 million passwords per second, single threaded.

Because I’m learning Rust, I decided to port it, and see how fast rust is.

Imagine my surprise when even the first version in Rust was faster. (Yes, I rebuilt the old C code with a modern compiler and its optimizations)

The first Rust version was about 13 million passwords per second.

Why is that? It’s basically the same as the C code. Maybe Rust can take advantage of knowing there’s no pointer aliasing (the reason usually quoted for why Fortran can be faster than C)? Or maybe the memory layout just happened to become more cache friendly?

In any case, I think we can already say that Rust is at least as fast as C.

The code is on github.

SIMD (with Rust)

I realized, of course, that the main performance tool neither my C nor Rust code took advantage of was SIMD. In theory compilers could generate SIMD instructions, but for best performance the algorithm itself should be SIMD aware.

In my case the compiler may be able to do some smaller math operations in one step, but the real gain can only be had by having the main loop check batches of passwords, instead of one at a time.

I’ve never really done this. I am aware of SIMD libraries for C, where you basically call the exact CPU instructions you want. That means your code is neither ready for another architecture (ARM, RISC-V), nor ready for the future (when you upgrade your hardware, and get AVX2 instructions).

Incidentally, this “being ready for the future” is why I’m looking forward to getting RISC-V hardware with vector instructions, since unlike Intel SIMD instructions, they work on variable length vectors. In theory your code can automatically get 2x faster when the CPU just doubles the SIMD register size. Not even a recompile needed.

The second best thing in terms of speed is a programming API that doesn’t assume what instructions will be used. Rust has an experimental std::simd API. You just code with its batch sizes (variable length may come some day), and it magically becomes SIMD.

This brought my zip cracker to 36 million passwords per second. Almost a 3x speedup. And that’s on my first attempt. I don’t even know what batch size is best. I just went with 16 because it sounded reasonable. Maybe another batch size is faster.

What if you CPU doesn’t have SIMD?

Then the experimental Rust SIMD library will generate normal non-SIMD instructions. In theory it should then run the same speed as if you’d not used the SIMD API at all.

But turns out it can be faster anyway. My StarFive VisionFive 2 doesn’t have SIMD or vector instructions, but it became 10-30% faster when using the Rust SIMD API.

Very likely this means that I could have manually tried passwords in batches, without std::simd. But why should I? If the CPU supports it, then extra performance is just a recompile away.

And that’s before multithreading it

My laptop is a Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz. That’s 6 cores, 12 hyperthreads.

With multithreading, and barely any communication between threads, that got me to about 280-290 million passwords per second. An 8x speedup. Not bad, considering that hyperthreads share some execution units, and they all share at least a bit of cache.

John the ripper, a famous password cracker, does about 130 million passwords a second (john --test --format=PKZIP), making mine over 2x faster.

GPUs

I’m sure the fastest password cracking, like all parallelizable number crunching, is the fastest when done on a GPU. But that was not the exercise here today.

General CPU code has a nice quality that it’ll work anywhere, and forever. GPU code, last I checked, is less portable than that.

That’s not to say that I wouldn’t want to port this to GPUs. But I would like that to be in Rust, too.

Update: Hashcat 6.2.6 does 1-2 billion passwords a second on my old GeForce 960, according to hashcat -b -m 17220/17225/17230. So yeah, not even close.