> No way, substantial speedups can really only come from algorithm changes.<p>I've often found the reverse to be true; that implementation details matter most.<p>The best fixes tend to be altering my code so that the compiler/cpu can be smarter. In this case, the author hinted the cache. Other times, I've found that looping in a different order helps my cache hit rate. Other times, properly naming variables and not reusing the same name allows the compiler to do smarter things.<p>Other times, when I've tried a new algorithm, it's slower because of bad constants or I've accidentally messed up the implementation to negate all benefits.<p>As a complete side note, you need to be careful when using "older" algorithms, ie those published pre-2000. I once implemented, based on a Hoare paper, a doubly linked list for spare matrices, and it was significantly slower then the naive method, simply because CPU cache lines have gotten really good while being able to predict linked list memory locations had not.