How I Got a 2x Speedup With One Line of Code

216 点作者 naftaliharris超过 11 年前

22 条评论

I'm sad to see that the one line of code, with its magic number and comparatively inscrutible contents, is entirely undocumented. Please, if doing things like that: a good commit message is necessary but not sufficient. (Yes, `git blame` exists, but you're not so likely to use it as a comment in the code.) A blog post is handy but not sufficient. A good inline comment is invaluable when returning to the code later if the situation changes, as it typically will in time for such precise, empirically derived performance tweaks.<pre><code> // This single line boosts performance by a factor of around 2 on GCC. // The optimal lookahead distance i+3 was chosen by experimentation. // See http://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/</code></pre>

评论 #6736878 未加载

评论 #6737417 未加载

评论 #6738154 未加载

susi22超过 11 年前

You can get another 2x speedup if you choose the pivot better. Right now it seems you're doing median of 3.When you're doing Order Statistics (ie Selection/Quickselect) with 1M elements you should be very very close to the theoretical optimum (1.5N) for complexity. See my comment here:<a href="https://news.ycombinator.com/item?id=6629117" rel="nofollow">https://news.ycombinator.com/item?id=6629117</a>If you're interested in choosing the best strategy.HTH

评论 #6735000 未加载

akkartik超过 11 年前

Great article. But it seems somehow unfair to characterize this as a 'one-line change'. He had to run various performance experiments that will be invisible to someone else reading the code, and the optimal prefetch distance might silently shift over time, like say if each object grows larger. It's a little like the Carmack magic number in that respect: <a href="https://en.wikipedia.org/wiki/Fast_inverse_square_root" rel="nofollow">https://en.wikipedia.org/wiki/Fast_inverse_square_root</a>I've had this idea for some time that our representation of programs is fundamentally broken, because it excludes a detailed characterization of the input space. I took a first stab at a solution with <a href="http://akkartik.name/post/tracing-tests" rel="nofollow">http://akkartik.name/post/tracing-tests</a>, which lets me write all my unit tests at the top level, showing how each piece of functionality augments the space of inputs my program can address. See, for example, <a href="https://github.com/akkartik/wart/blob/3039fe6d03/literate/023tokenize#L83" rel="nofollow">https://github.com/akkartik/wart/blob/3039fe6d03/literate/02...</a>. I suspect reading other people's 'code' could be an order of magnitude easier and more convenient if 'code' included the right information. I'm still working to flesh this idea out.

评论 #6738171 未加载

liyanchang超过 11 年前

> No way, substantial speedups can really only come from algorithm changes.I've often found the reverse to be true; that implementation details matter most.The best fixes tend to be altering my code so that the compiler/cpu can be smarter. In this case, the author hinted the cache. Other times, I've found that looping in a different order helps my cache hit rate. Other times, properly naming variables and not reusing the same name allows the compiler to do smarter things.Other times, when I've tried a new algorithm, it's slower because of bad constants or I've accidentally messed up the implementation to negate all benefits.As a complete side note, you need to be careful when using "older" algorithms, ie those published pre-2000. I once implemented, based on a Hoare paper, a doubly linked list for spare matrices, and it was significantly slower then the naive method, simply because CPU cache lines have gotten really good while being able to predict linked list memory locations had not.

评论 #6735855 未加载

评论 #6735900 未加载

nly超过 11 年前

I don't think this gain came from pre-fetching anything. The compiler may simply have decided to treat the entire function, or even the entire file, differently because of that one line.I compiled (x86, GCC 4.8) lazysort.c both with and without the prefetch intrinsic, and there were substantial differences in code surrounding calls in to Python, but no change at all to the guts of the partition or sorting routines.Regardless of what is responsible, it's not good to leave something like this in your code without really understanding what it did. There may have something else that occurred incidentally from this change, that could have lead you to a better solution.

评论 #6737694 未加载

评论 #6737869 未加载

sp332超过 11 年前

There are only two really difficult problems in computer science: naming things, cache invalidation, and off-by-one errors.

评论 #6735301 未加载

评论 #6735197 未加载

wffurr超过 11 年前

>> The moral of the story, at least for me, is that you can occasionally get substantial improvements by understanding what is actually happening under the hood of your application, rather than fundamentally changing your application itself.Also known as the rule of leaky abstractions, as in all abstractions are leaky.

评论 #6736233 未加载

Scaevolus超过 11 年前

Note that `ob_item[i+3]` could read past the end of the array.

评论 #6736631 未加载

AznHisoka超过 11 年前

I got a 1000 X speedup by removing a sleep(100000) line once. Not as impressive though.

评论 #6735736 未加载

评论 #6737231 未加载

评论 #6735870 未加载

Someone超过 11 年前

I would expect that that magic number 3 is related to the size of a cache line relative to the size of an array element and the time spent in each loop. You will want to prefetch the next cache line so that it is just available by the time you want to start processing it (and, ideally, not a moment earlier because that would push data from the cache that still could be used before the requested data is needed)A robust implementation should figure out the size of a cache line and the load delay, and derive the magic offset from that.

temujin超过 11 年前

Er, are you aware of the C++ standard library's std::nth_element() function? (Even my "pure C" programs tend to be compiled with g++ to gain access to that and std::sort().)

评论 #6737259 未加载

Amadou超过 11 年前

Now, I didn't just "know" to prefetch the PyObject three indices forward in the array. I arrived at this value through experimentation.There is a good chance that 3 is specific to the hardware you tested on. Different systems will have different memory latencies (and NUMA systems can make it even more complicated). It isn't likely that 3 will turn into a pathological case on any other hardware, but it may well end up being less optimal than say 2 or 4.

m_mueller超过 11 年前

This sort of thing is why porting naive x86 code to the GPU programming model (scalar programs aka 'kernels' streamed through on the device using a configuration) often results in amazing 15-20x speedup over 6 core, while the expected speedup is only 5-7x. x86 compilers often don't get the pre-fetching right in loops, only using their heuristics, so you have to talk to them. Then you you begin to doubt the compiler more and more, carving out a few percentage points here and there by adding vector intrinsics, loop unrolling and so on. And before you notice, your codebase is now bigger, more complex and more error prone than if you had ported it to GPU from the beginning.Please note - of course this insight doesn't apply in your case since your algorithm is meant for general purpose usage, not HPC programming meant for systems where you have the hardware under your control. I'm simply urging people where the latter applies to think about whether their algorithms might be suitable for GPU instead - doing lots of prefetch instructions and loop unrolls is a sign of that.

gridspy超过 11 年前

Nice article.One thing that wasn't clear to me in your explanation is that you're sorting a list of pointers to objects. To sort that list you're comparing two of the objects themselves.<pre><code> // follows pointer to access *ob_item[i] IF_LESS_THAN(ob_item[i], pivot) </code></pre> So the point of __builtin_prefetch is to deference this pointer in advance and so avoid the latency of the 12 million reads scattered all over memory. Nice.Another useful thing to do here is to see if you can find a "streaming operator" to dismiss *ob_item[i] and ob_item[i] from cache after the comparison. They won't be needed again for a while.Another good article on this optimisation (mentioned by OP)<a href="http://scripts.mit.edu/~birge/blog/accelerating-code-using-gccs-prefetch-extension/" rel="nofollow">http://scripts.mit.edu/~birge/blog/accelerating-code-using-g...</a>

minimax超过 11 年前

This is a great post. I guess the idea here is that you can start bringing future objects to be compared up through the cache hierarchy while you are doing the comparison on the current object. If that's the case, then I think the speedup from prefetching will depend on the speed of the comparison which in turn will depend on the type of objects in the collection.In your post it says you have a collection of 10MM elements but you didn't say what they were. Are they just ints or something else?

评论 #6735557 未加载

Nimi超过 11 年前

Funny, I just experimented with __builtin_prefetch this week, and got no speedup. Does anyone know whether the kernel list.h always triggers hardware prefetching when going over a list?BTW, my case wasn't the problem the kernel maintainers encountered with small lists, detailed here:<a href="http://lwn.net/Articles/444336/" rel="nofollow">http://lwn.net/Articles/444336/</a>

alextingle超过 11 年前

I'm pretty sure you could have saved yourself a lot of work by just using profile guided optimisation (PGO). It's mature & ready to go in GCC at least.(Integrating it into you build process is a little bit more challenging, I admit. I've set it up in a large dev environment used by multiple large projects, and it was well, well worth the effort.)

jheriko超过 11 年前

its one of those things... micro optimisation becomes the big one once you do enough algorithm optimisation. even the humble memory copy becomes faster from prefetching and clever use of registers... and actually making the algorithm theoretically worse. :Palso, note that it is possible to remove the magic from your magic number with some thought about latency, the size and alignment of your data etc. fetching a cache line takes a fairly predictable amount of cycles.opposite to the prefetch is the non temporal read/writes where you don't pollute cache to prevent big one off copies from damaging the performance of other code...

joosters超过 11 年前

Interesting post! I wonder if the author experimented with different gcc command-line optimisation options as well? gcc might be able to insert some prefetches by itself with some settings?

评论 #6737697 未加载

mattholtom超过 11 年前

Came expecting <a href="http://thedailywtf.com/Articles/The-Speedup-Loop.aspx" rel="nofollow">http://thedailywtf.com/Articles/The-Speedup-Loop.aspx</a>, was disappointed...

jheriko超过 11 年前

also, read this: <a href="http://www.gamedev.net/page/resources/_/technical/graphics-programming-and-theory/graphics-programming-black-book-r1698" rel="nofollow">http://www.gamedev.net/page/resources/_/technical/graphics-p...</a>

clintonc超过 11 年前

I get at least that good with<pre><code> DEFINT A-Z</code></pre>

22 条评论

chrismorgan超过 11 年前

评论 #6736878 未加载

评论 #6737417 未加载

评论 #6738154 未加载

susi22超过 11 年前

评论 #6735000 未加载

akkartik超过 11 年前

评论 #6738171 未加载

liyanchang超过 11 年前

评论 #6735855 未加载

评论 #6735900 未加载

nly超过 11 年前

评论 #6737694 未加载

评论 #6737869 未加载

sp332超过 11 年前

There are only two really difficult problems in computer science: naming things, cache invalidation, and off-by-one errors.

评论 #6735301 未加载

评论 #6735197 未加载

wffurr超过 11 年前

评论 #6736233 未加载

Scaevolus超过 11 年前

Note that `ob_item[i+3]` could read past the end of the array.

评论 #6736631 未加载

AznHisoka超过 11 年前

I got a 1000 X speedup by removing a sleep(100000) line once. Not as impressive though.

评论 #6735736 未加载

评论 #6737231 未加载

评论 #6735870 未加载

Someone超过 11 年前

temujin超过 11 年前

Er, are you aware of the C++ standard library's std::nth_element() function? (Even my "pure C" programs tend to be compiled with g++ to gain access to that and std::sort().)

评论 #6737259 未加载

Amadou超过 11 年前

m_mueller超过 11 年前

gridspy超过 11 年前

minimax超过 11 年前

评论 #6735557 未加载

Nimi超过 11 年前

alextingle超过 11 年前

jheriko超过 11 年前

joosters超过 11 年前

Interesting post! I wonder if the author experimented with different gcc command-line optimisation options as well? gcc might be able to insert some prefetches by itself with some settings?

评论 #6737697 未加载

mattholtom超过 11 年前

Came expecting <a href="http://thedailywtf.com/Articles/The-Speedup-Loop.aspx" rel="nofollow">http://thedailywtf.com/Articles/The-Speedup-Loop.aspx</a>, was disappointed...

jheriko超过 11 年前

clintonc超过 11 年前

I get at least that good with<pre><code> DEFINT A-Z</code></pre>