I built one a few years ago. I never published it or anything but I figure it's worth a discussion point here.<p>1. Multiply is a primitive that seems to mix the most bits in the fewest clock ticks.<p>2. Multiply's weakness is that the bottom bits have almost no mixing at all. This is easily fixed with bit-reverse, which is implemented in one clock tick on both NVidia and AMD GPUs. X86 can use bswap instead. ARM has bit reverse.<p>3. Finding the best 32-bit constant to multiply is the hard part. Until you remember that GPUs can search the entire 32-bit space in less than a miliseconds. (4-Billion numbers across 2GHz and about 4000 shaders so like 8000x 32-bit space traversals per second).<p>4. The RNG is therefore defined as #3, what kind of searches and heuristics do you optimize for?<p>5. I chose to calculate the pop count(change of bits) and optimized for the largest number of popcnt==16 bitflips, which is a weird definition of the avalanche effect but hey I needed something. Popcnt is also a one-click tick instruction.<p>6. The remaining effort is an exercise left for the reader. But it took me like 2 days, it's not hard.<p>7. EDIT: Remember the lessons from XORShift. Have a simple counter (ex: state++) combined with a reversible hash-like function (ex: return hash(state++);). The above discussion is mostly about the hash. State++ function is anything that trivially cycles the 32-bit or 64-bit space for maximum cycle length. I find that state+=(odd number) is a superior mixer that still trivially traverses the whole space. Use above discussion to search for optimal constant.