科技回声

11 条评论

Tommah超过 1 年前

Eh, you should probably test it either way. Performance can be very unpredictable. I ran into a weird case years ago when I was writing a program that needed to calculate popcount (number of 1 bits in a number). My method using a 256-entry lookup table turned out to be faster than using GCC's popcount intrinsic. (See [1] for a question from someone who encountered a similar situation. The solution provided there was to compile with -march=native, but I think I tried that without success.)I ran into a similar problem in another program where I needed 128-bit integers. I implemented them myself as a C struct containing two 64-bit integers. Then I discovered that GCC implemented its own 128-bit integer type. But that type was somehow slower than my implementation![1] <a href="https://stackoverflow.com/questions/52161596/why-is-builtin-popcount-slower-than-my-own-bit-counting-function" rel="nofollow noreferrer">https://stackoverflow.com/questions/52161596/why-is-builtin-...</a>

评论 #37825634 未加载

评论 #37825416 未加载

评论 #37825152 未加载

fluoridation超过 1 年前

This is very long for something that's quite simple: If your input space is small enough that it fits into an 8- or 16-bit LUT, try it. If it's faster, use the LUT; if it's not, don't use it. Really, that's all that needs to be said. The cases where LUTs this small crop up are so rare that you lose nothing by trying. There's no use in trying to predict the performance when you can just measure it.

评论 #37826616 未加载

评论 #37824761 未加载

评论 #37826992 未加载

rented_mule超过 1 年前

I ran into CPU cache issues in a slightly different application 15+ years ago. I was working on indexing gigabytes of text on a mobile CPU (before smart phones caused massive investment in such CPUs). Word normalization logic (e.g., sky/skies/sky's -> sky) was very slow, so I used a word cache which sped it up immensely.The initial version of the word cache cleared the entire cache when it hit a certain number of entries, just as a placeholder. When I got around to adding some LRU eviction logic, it became faster on our desktop simulator, but far slower on the embedded device (slower than with no word cache at all). The difference came down to the CPU cache size / strategy differences between the desktop and mobile CPUs.We ended up shipping the "dumb" word cache logic because it was so much faster in practice. The word cache eviction function was only two lines of code plus a large comment that acknowledged that "yes, this looks dumb" and imploring that the speed of any future improvements be compared to the dumb version on the target device. Testing real data on the hardware used in practice is incredibly important.

jameshart超过 1 年前

Not hugely convincing to put a toy model up against a micro benchmark and claim the toy model proves the micro benchmark wrong.The correct way to compare these approaches is with real code.

评论 #37824989 未加载

评论 #37825030 未加载

dragontamer超过 1 年前

This is a lot of words about things that probably don't matter.IMO: the bottleneck of lookup tables is concisely described as follows: 2 or 3 lookups in L1 cache per cycle on modern CPUs, and 10x fewer to 500x fewer depending on how far away (L2, L3, DDR near, or DDR remote).Meanwhile, modern CPU instructions like AES, Multiply, XOR, ADD and more can effectively operate 3, 4, or even more times per clock tick regardless of circumstances.--------Don't even talk about cache hierarchies. You are already behind at the L1 cache level due to the relatively few load/store units in modern CPU (or GPU) cores. And hitting L2 or L3 cache gets exponentially worse.

评论 #37824152 未加载

评论 #37824782 未加载

thot_experiment超过 1 年前

Interesting article but very in the weeds on it's usecases and I wish the caveats on when it applies were more clearly stated. There are many situations where the computation being done is prohibitively expensive compared to the time horizon required, LUTs are great for that sort of thing. IMO people should use LUTs MORE often.

评论 #37824877 未加载

coolhand2120超过 1 年前

In my work LUTs are a reaction to really poor performance, not a matter of course. Every time I reach for one my inner dialog says “you (probably) should use a lookup table”.

qingcharles超过 1 年前

I always used to use lookup tables for sin/cos in the days of software 3D rendering.The developers of Mario 64 did the same, but this video dives into those LUTs and brings them into question:<a href="https://www.youtube.com/watch?v=xFKFoGiGlXQ">https://www.youtube.com/watch?v=xFKFoGiGlXQ</a>

capitainenemo超过 1 年前

A subset of (3) that is the reason for Hedgewars' sine lookup table. "When you want the results to be very predictable" Deterministic lockstep has always had issues with floating point variations across platforms, and I imagine that applies to other things as well...

zschoche超过 1 年前

Well, it always depends on what you lookup i.e. how much time one needs to compute entries of the given lookup table. If it takes O(2^n) time , then payback comes quickly, where n is the input size. It’s good advice to be very skeptical if one drops general statements like that. Look for algorithms with the best worst -case time complexity is a good start. It will always win. Theory works. Question is only of big the input has to be in order to see that. Here, the fun part starts I guess.

gpderetta超过 1 年前

For the most part I agree. This can generalized with the observation that a) sometimes is faster to recompute a function than caching it, and b) cache is a limited resource usually more than CPU cycles.

评论 #37825208 未加载

11 条评论

Tommah超过 1 年前

评论 #37825634 未加载

评论 #37825416 未加载

评论 #37825152 未加载

fluoridation超过 1 年前

评论 #37826616 未加载

评论 #37824761 未加载

评论 #37826992 未加载

rented_mule超过 1 年前

jameshart超过 1 年前

Not hugely convincing to put a toy model up against a micro benchmark and claim the toy model proves the micro benchmark wrong.The correct way to compare these approaches is with real code.

评论 #37824989 未加载

评论 #37825030 未加载

dragontamer超过 1 年前

评论 #37824152 未加载

评论 #37824782 未加载

thot_experiment超过 1 年前

评论 #37824877 未加载

coolhand2120超过 1 年前

In my work LUTs are a reaction to really poor performance, not a matter of course. Every time I reach for one my inner dialog says “you (probably) should use a lookup table”.

qingcharles超过 1 年前

capitainenemo超过 1 年前

zschoche超过 1 年前

gpderetta超过 1 年前

评论 #37825208 未加载

You probably shouldn't use a lookup table (2022)

11 条评论

You probably shouldn't use a lookup table (2022)

11 条评论