TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

You probably shouldn't use a lookup table (2022)

96 点作者 segfaultbuserr超过 1 年前

11 条评论

Tommah超过 1 年前
Eh, you should probably test it either way. Performance can be very unpredictable. I ran into a weird case years ago when I was writing a program that needed to calculate popcount (number of 1 bits in a number). My method using a 256-entry lookup table turned out to be faster than using GCC&#x27;s popcount intrinsic. (See [1] for a question from someone who encountered a similar situation. The solution provided there was to compile with -march=native, but I think I tried that without success.)<p>I ran into a similar problem in another program where I needed 128-bit integers. I implemented them myself as a C struct containing two 64-bit integers. Then I discovered that GCC implemented its own 128-bit integer type. But that type was somehow slower than my implementation!<p>[1] <a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;52161596&#x2F;why-is-builtin-popcount-slower-than-my-own-bit-counting-function" rel="nofollow noreferrer">https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;52161596&#x2F;why-is-builtin-...</a>
评论 #37825634 未加载
评论 #37825416 未加载
评论 #37825152 未加载
fluoridation超过 1 年前
This is very long for something that&#x27;s quite simple: If your input space is small enough that it fits into an 8- or 16-bit LUT, try it. If it&#x27;s faster, use the LUT; if it&#x27;s not, don&#x27;t use it. Really, that&#x27;s all that needs to be said. The cases where LUTs this small crop up are so rare that you lose nothing by trying. There&#x27;s no use in trying to predict the performance when you can just measure it.
评论 #37826616 未加载
评论 #37824761 未加载
评论 #37826992 未加载
rented_mule超过 1 年前
I ran into CPU cache issues in a slightly different application 15+ years ago. I was working on indexing gigabytes of text on a mobile CPU (before smart phones caused massive investment in such CPUs). Word normalization logic (e.g., sky&#x2F;skies&#x2F;sky&#x27;s -&gt; sky) was very slow, so I used a word cache which sped it up immensely.<p>The initial version of the word cache cleared the entire cache when it hit a certain number of entries, just as a placeholder. When I got around to adding some LRU eviction logic, it became faster on our desktop simulator, but far slower on the embedded device (slower than with no word cache at all). The difference came down to the CPU cache size &#x2F; strategy differences between the desktop and mobile CPUs.<p>We ended up shipping the &quot;dumb&quot; word cache logic because it was so much faster in practice. The word cache eviction function was only two lines of code plus a large comment that acknowledged that &quot;yes, this looks dumb&quot; and imploring that the speed of any future improvements be compared to the dumb version on the target device. Testing real data on the hardware used in practice is incredibly important.
jameshart超过 1 年前
Not hugely convincing to put a toy model up against a micro benchmark and claim the toy model proves the micro benchmark wrong.<p>The correct way to compare these approaches is with real code.
评论 #37824989 未加载
评论 #37825030 未加载
dragontamer超过 1 年前
This is a lot of words about things that probably don&#x27;t matter.<p>IMO: the bottleneck of lookup tables is concisely described as follows: 2 or 3 lookups in L1 cache per cycle on modern CPUs, and 10x fewer to 500x fewer depending on how far away (L2, L3, DDR near, or DDR remote).<p>Meanwhile, modern CPU instructions like AES, Multiply, XOR, ADD and more can effectively operate 3, 4, or even more times per clock tick regardless of circumstances.<p>--------<p>Don&#x27;t even talk about cache hierarchies. You are already behind at the L1 cache level due to the relatively few load&#x2F;store units in modern CPU (or GPU) cores. And hitting L2 or L3 cache gets exponentially worse.
评论 #37824152 未加载
评论 #37824782 未加载
thot_experiment超过 1 年前
Interesting article but very in the weeds on it&#x27;s usecases and I wish the caveats on when it applies were more clearly stated. There are many situations where the computation being done is prohibitively expensive compared to the time horizon required, LUTs are great for that sort of thing. IMO people should use LUTs MORE often.
评论 #37824877 未加载
coolhand2120超过 1 年前
In my work LUTs are a reaction to really poor performance, not a matter of course. Every time I reach for one my inner dialog says “you (probably) should use a lookup table”.
qingcharles超过 1 年前
I always used to use lookup tables for sin&#x2F;cos in the days of software 3D rendering.<p>The developers of Mario 64 did the same, but this video dives into those LUTs and brings them into question:<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=xFKFoGiGlXQ">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=xFKFoGiGlXQ</a>
capitainenemo超过 1 年前
A subset of (3) that is the reason for Hedgewars&#x27; sine lookup table. &quot;When you want the results to be very predictable&quot; Deterministic lockstep has always had issues with floating point variations across platforms, and I imagine that applies to other things as well...
zschoche超过 1 年前
Well, it always depends on what you lookup i.e. how much time one needs to compute entries of the given lookup table. If it takes O(2^n) time , then payback comes quickly, where n is the input size. It’s good advice to be very skeptical if one drops general statements like that. Look for algorithms with the best worst -case time complexity is a good start. It will always win. Theory works. Question is only of big the input has to be in order to see that. Here, the fun part starts I guess.
gpderetta超过 1 年前
For the most part I agree. This can generalized with the observation that a) sometimes is faster to recompute a function than caching it, and b) cache is a limited resource usually more than CPU cycles.
评论 #37825208 未加载