The pcg-random methodology is very good. The basis is the following pattern:<p>#1: Have a "counter" of some kind. The simple counter trivially creates a maximum length cycle. "seed++" is one such counter.<p>#2: Have a "hash" that converts the counter into a random number, using a 1-to-1 bijection. There are a great many 1-to-1 bijections.<p>So your RNG is basically:<p><pre><code> oldSeed = seed;
seed = counterFunction(seed);
return hash(oldSeed);
</code></pre>
------------<p>I used this methodology to make a pretty fast generator myself. The coding was done in a weekend, but it took some weeks to test: <a href="https://github.com/dragontamer/AESRand" rel="nofollow">https://github.com/dragontamer/AESRand</a><p>For AESRand:<p>"counterFunction" is a SIMD 2x64-bit Add over the XMM register. I chose an odd number arbitrarily (which are just the first 16 prime numbers: 0x01030507...), which will trivially cycle after 2^64 numbers.<p>"hash" is aesenc(aesenc(seed, constant), constant), where constant is an arbitrary (chosen to be the same odd number that was used in the "counterFunction" step). The 2nd parameter to aesend is just an XOR.<p>I also run a 2nd parallel aesdec(aesenc(seed, constant), constant) to generate a 2nd set of 128-bit random numbers, for 256-bits to be made across the hash. Yes, a 128-bit seed "hashes" into a 256-bit random number. Since this passes statistical tests, its probably fine, and this is a good source of instruction-level-parallelism.<p>All in all, I achieved 30GB/sec random numbers. Or 256-bits of random numbers every 3.7 cycles that passes BigCrush, TestU01, and PractRand.<p>--------<p>AES is a 1-to-1 bijection: that's a necessary condition for AES to be decoded. Though I use AES, this function is NOT cryptographically sound. I'm just using AES because its the fastest high-quality bijection in the entire modern CPU instruction set. x86, ARM, and POWER9 all implement single-cycle AES instructions.<p>It is possible to write a portable AESRand across the different CPUs (ARM, x86, and POWER9). I got far enough to prove that its possible, but then my real work started to ramp up and I had to drop this hobby project. Its very tricky to do so: ARM, x86, and POWER9 implement AESENC slightly differently: shuffling the fundamental steps in different ways.<p>Or in the case of POWER9, it executes in big-endian mode instead of little-endian (like ARM / x86). Redesigning the algorithm to be portable across the shuffled "SubBytes" and "MixColumns" steps between the CPUs is tricky but possible.