TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Building a Fast, SIMD/GPU-Friendly Random Number Generator for Fun and Profit

10 点作者 klaussilveira22 天前

2 条评论

dragontamer22 天前
I built one a few years ago. I never published it or anything but I figure it&#x27;s worth a discussion point here.<p>1. Multiply is a primitive that seems to mix the most bits in the fewest clock ticks.<p>2. Multiply&#x27;s weakness is that the bottom bits have almost no mixing at all. This is easily fixed with bit-reverse, which is implemented in one clock tick on both NVidia and AMD GPUs. X86 can use bswap instead. ARM has bit reverse.<p>3. Finding the best 32-bit constant to multiply is the hard part. Until you remember that GPUs can search the entire 32-bit space in less than a miliseconds. (4-Billion numbers across 2GHz and about 4000 shaders so like 8000x 32-bit space traversals per second).<p>4. The RNG is therefore defined as #3, what kind of searches and heuristics do you optimize for?<p>5. I chose to calculate the pop count(change of bits) and optimized for the largest number of popcnt==16 bitflips, which is a weird definition of the avalanche effect but hey I needed something. Popcnt is also a one-click tick instruction.<p>6. The remaining effort is an exercise left for the reader. But it took me like 2 days, it&#x27;s not hard.<p>7. EDIT: Remember the lessons from XORShift. Have a simple counter (ex: state++) combined with a reversible hash-like function (ex: return hash(state++);). The above discussion is mostly about the hash. State++ function is anything that trivially cycles the 32-bit or 64-bit space for maximum cycle length. I find that state+=(odd number) is a superior mixer that still trivially traverses the whole space. Use above discussion to search for optimal constant.
dzaima22 天前
ARM NEON is kinda misrepresented here - it supports both bitshifts by non-constants, and proper 32×32→32-bit multiplies; AVX2 also supports both of those.