DIRSIG5 is a Monte Carlo numerical integration program, meaning pseudo random number generation (PRNG) is at the heart of its simulation routines.

For a long time, DIRSIG5 relied on the Mersenne Twister (MT) random number generator available in C++ STL implementations as std::mt19937. In July 2021 however, we swapped out the MT generator for a more modern Permuted Congruential Generator (PCG), leading to surprising performance gains—on average, 20-40% reduction in simulation runtime. The purpose of this document is to shine a light on why MT was hampering our performance, and how PCG fixed it.

What’s wrong with Mersenne Twister?

The MT generator has been around since 1997, and has become the default RNG for many programming languages and libraries. People were enamored with it when it first came out because of its insanely long period, which is the Mersenne prime 2^199937 - 1. While the size of that number is truly staggering, it doesn’t really matter because nearly everything else about the algorithm stinks by modern standards. It’s bulky and slow, it requires >2KB of RAM for a single instance of the generator, and it fails a bunch of randomness tests.

In a multi-threaded Monte Carlo simulation program like DIRSIG, the pitfalls of the MT generator work starkly against our best interest. To maximize performance, modern programs must be mindful of memory accesses. Your CPU clock might be measured in GHz, but your RAM clock is measured in MHz. This is especially true for ray tracing programs, where the name of the game is randomized incoherent access to tons of memory. When you couple that with a bunch of thread-local RNGs that have to frequently touch 2KB of state to produce random numbers intermittently, it is a recipe for disaster. It completely kills the remaining CPU cache performance.

Permuted Congruential Generators to the rescue

The family of space-efficient PCG generators manage to solve basically all of the problems we have with Mersenne Twister. The name Permuted Congruential Generator is short for Permuted Linear Congruential Generator (LCG), which is the bare-bones simplest RNG out there. While LCGs are statistically very bad and highly predictable, their simplicity means they represent the epitome of performance. It is literally an integer multiply and add, and that’s all. You can’t really do better than that in terms of machine instructions and memory accesses.

With that said, the statistically terrible nature of LCGs renders them unusable in practice. The novelty in the PCG approach is to feed the LCG output to a carefully-designed permutation function which scrambles the bits in such a way that it becomes statistically very good, i.e., passes empirical randomness test suites like TestU01, which MT fails. Actually coming up with the permutation function is tricky, but many such functions exist. All of the relevant theory is laid out in the PCG paper. We use the XOR-Shift + Random-Rotate (XSH-RR) output function, using 64-bit internal state and 32-bit output state.

Importantly, the permutation function is hardcoded with constants, meaning it incurs no further state or memory access and is inlined by the compiler. The total size of the generator is 16 bytes, being two 64-bit integers. One is the state and the other the LCG increment, which can be configured but is held constant for the lifetime of the generator.

The implementation looks like this,

uint32_t Pcg32::operator()() {
  uint64_t prev = state;
  state = state * multiplier + increment;
  return output(prev);
}

where state and increment are members of the Pcg32 data structure, and multiplier is just a big compile-time constant. The output function is our XSH-RR permutation. The optimized assembly of this function compiled on its own is as follows:

movq (%rdi), %rcx
movabsq $6364136223846793005, %rax
imulq %rcx, %rax      ; LCG multiply
addq 8(%rdi), %rax    ; LCG add
movq %rax, (%rdi)     ; Update internal state
movq %rcx, %rax       ; Feed previous state through permutation
shrq $18, %rax        ; ... apply random XOR
xorq %rcx, %rax
shrq $27, %rax        ; ... apply random rotation
shrq $59, %rcx
rorl %cl, %eax
ret

This applies the LCG multiply/add, updates the internal state of the generator, then feeds its previous state through the permutation function and returns the result. It’s around 10 instructions and is totally branchless.

It’s worth noting that a pure LCG implementation would be around 5 instructions (just imagine the permutation operations aren’t there). So the overhead of the permutation relative to the "floor" of the fastest possible LCG implementation is 5 register-only integer operations. The equivalent assembly for the Mersenne Twister is around 100 instructions (optimizing for either speed -O3 or size -Os, it doesn’t really matter) with lots of branching and looping, which can touch as much as 5 kilobytes on Linux (Ubuntu 20.04 LTS), where sizeof(std::mt19937) == 5000.