A while ago I was doing some profiling to find out ways to improve the load performance of my Open Diapason virtual organ project. The load process is roughly as follows:
- Load wave file from disk into memory.
- Apply a linear phase filter to all loaded samples (this is to compensate for the response of the resampler which occurs during playback).
- Normalise and quantise the filtered audio using TPDF Dither. The normalisation ensures that we use the full 16-bits of precision for each sample. Obviously, the scaling coefficient is stored along with the sample to ensure we play the content back at the intended level.
All of these steps are done across multiple threads to maximise performance (there is one thread that is reading audio files from disk ready to spawn threads which apply the filter and requantise the data)… this is one of those times where the speedup was roughly linear with the first couple of threads which I added.
Anyway, I was expecting that the vast majority of the cycles would be spent in the application of the filter, but this actually turned out not to be the case. The dithering and re-quantisation of the audio - which I assumed would be cheap - turned out to be chewing cycles on the order of 30% of the load time (all of my profiling was done on a mid 2014 MacBook Pro with an i7 processor).
The fix
Uniform random numbers are useful and often we need several of them. In my use-case (performing TPDF dither on a sample pair) I needed four random variables. Generating them needs to be fast but the “quality” of the randomness is not that important. The obvious way to do this is to use a Linear Congruential Generator which ends up forming a code pattern like this:
= (state * A + C) % M;
r0 = (r0 * A + C) % M;
r1 = (r1 * A + C) % M;
r2 = (r2 * A + C) % M;
r3 = r3; state
This forms a dependency chain where each random number requires knowledge of the previous one and could end up leading to pipeline issues (this is really dependent on how the compiler re-orders the code and the architecture of the platform). In the code generated on my machine, this turned out to be a bottleneck. We could get around the pipeline issues by creating four different LCGs:
= (state0 * A0 + C0) % M;
r0 = (state1 * A1 + C1) % M;
r1 = (state2 * A2 + C2) % M;
r2 = (state3 * A3 + C3) % M;
r3 = r0;
state0 = r1;
state1 = r2;
state2 = r3; state3
But this will require more registers to store the different LCG states and possibly more registers or code memory to hold all the new coefficients. It would be nice to have something a bit simpler…
LCGs are really cool toys. We can choose our \(A_n\) and \(C_n\) values in such a way that the pseudo-random sequence has period \(M\). Given the algorithm generates the next value from the previous, this must mean that the random sequence contains every integer between \(0\) and \(M-1\). The first part of the idea, is to modify the code as:
= (state0 * A0) % M;
r0 = (state1 * A1) % M;
r1 = (state2 * A2) % M;
r2 = (state3 * A3) % M;
r3 = (r0 + C0 % M) % M;
state0 = (r1 + C1 % M) % M;
state1 = (r2 + C2 % M) % M;
state2 = (r3 + C3 % M) % M; state3
This produces a different sequence in \(r_0\) to \(r_3\) because the accumulation now happens at the state assignment, but still has a period \(M\). Now for the dodgy part, we change the algorithm to this:
= (state0 * A0) % M;
r0 = (state0 * A1) % M;
r1 = (state0 * A2) % M;
r2 = (state0 * A3) % M;
r3 = (r0 + C0 % M) % M; state0
Where \(C_0 = 1\), \(M = 2^{32}\) and \(A_0\) to \(A_3\) are all different but satisfy the required conditions for an LCG to work. It should not be difficult to infer that \(r_0\) to \(r_3\) still all have period \(M\) - but produce different output sequences. I wrote a test program to look at the correlation of these sequences over their period and they are good enough. I doubt they make great random numbers, but for performing dither, they are completely fine.