Binary Search: A new implementation that is up to 25% faster

171 pointsby scandumalmost 5 years ago

8 comments

ghjalmost 5 years ago

The readme doesn't describe why it's faster. Looking at the code of the variant that is the fastest: <a href="https://github.com/scandum/binary_search/blob/master/binary-search.c#L201" rel="nofollow">https://github.com/scandum/binary_search/blob/master/binary-...</a>It seems like it's a quaternary search which seems like an arbitrary "magic number" of interior points. It's easy to understand what it does if you already know other variants like ternary search (cut search space into 3 to pick 2 interior points) or golden section search (same thing as ternary except in golden ratio). Here, quaternary search is just picking 3 interior points after dividing into 4 parts.So the speed up is the same as how b-trees get their speedup: increase branching factor which costs more comparisons but reduces the depth. I might be wrong, but instead of quaternary it could also be 5-ary or 8-ary or any B-ary and any of these variants can also have the potential to perform better.Just a tradeoff between: cost_of_divide * log_B(N) + cost_of_compare * (B - 1) * log_B(N)EDIT: Thinking about it more, the divison doesn't seem like it should be the most expensive operation (especially relative to compares/branching). Anyone have any better ideas on why you would prefer more compares? Is it some pipelining thing?

评论 #23894696 未加载

评论 #23895445 未加载

评论 #23895625 未加载

评论 #23906750 未加载

ravalmost 5 years ago

I checked the plot against an old school project where I investigated binary search, and I got similar numbers for the "standard" binary search (std::lower_bound in my project): For N=1000,10000,100000 it took 713,925,1264 us (compare to the author's 750,960,1256 us).In my project I didn't test "quaternary search", but instead the winner was the Eytzinger layout that others have discussed, with 570,808,1044 us (compare to author's quaternary search: 557,764,1009 us).It would be interesting to try a 4-way Eytzinger layout.My project report, if anyone's interested: <a href="https://dl.strova.dk/ksb-manuel-rav-algorithm-engineering-20200317.pdf" rel="nofollow">https://dl.strova.dk/ksb-manuel-rav-algorithm-engineering-20...</a>EDIT: I reuploaded the linked PDF since the first one I uploaded was missing the source code.

eloffalmost 5 years ago

Binary search can be implemented using conditional move instructions instead of branching. Given the branches are unpredictable, this yields a huge speedup. It can be implemented like this in pure C, if written in a way the compiler can recognize. In my experience this is the fastest binary search.[1] <a href="https://pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions/" rel="nofollow">https://pvk.ca/Blog/2012/07/03/binary-search-star-eliminates...</a>

评论 #23894709 未加载

评论 #23894694 未加载

pbiggaralmost 5 years ago

This is really cool!About a decade ago I did research on sorting and searching, which was the same type of work going on here (<a href="https://github.com/pbiggar/sorting-branches-caching" rel="nofollow">https://github.com/pbiggar/sorting-branches-caching</a>). I found it was extremely difficult to accurately say whether one algorithm is faster than another in the general case. I found a bunch of speed improvements that probably only apply in processors with very long pipelines (like the P4)).Execution speed is probably the right metric here, but it makes it hard to understand _why_ it's faster.PS. Looking at "boundless interpolated search" (<a href="https://github.com/scandum/binary_search/blob/master/binary-search.c#L234" rel="nofollow">https://github.com/scandum/binary_search/blob/master/binary-...</a>), it seems it's missing a `++checks`. I initially thought this could be the cause of a misreporting it as being faster, but I see the benchmark is run-time based so that wouldn't cause it unless the incrementing itself is the bottleneck.

utopcellalmost 5 years ago

It is nice to see benchmarks of elementary algorithms, but none of these is new. Benchmarking would be more convincing if more distributions were used. For example, we know that interpolated search is O(lglgn) for uniform distributions that are used here, but can be linear in pathological cases.

评论 #23897492 未加载

评论 #23897395 未加载

diehundealmost 5 years ago

I've seen most real-world applications that use in-memory search, use balanced binary search trees. Are any of these improvements applicable to those data structures? AVL and Red-black trees come to mind.

评论 #23894623 未加载

dehrmannalmost 5 years ago

On the topic of binary searches, I was wondering if they can be done without any branching so you can avoid branch misses and possibly parallelize it with SIMD instructions. I think I was able to get it with some optimizer flags and GCC built-ins.<a href="https://github.com/ehrmann/branchless-binary-search" rel="nofollow">https://github.com/ehrmann/branchless-binary-search</a>

评论 #23898088 未加载

peter_d_shermanalmost 5 years ago

If the data to be binary searched is static, that is, if it doesn't change, and if it fits entirely in RAM, then what I would do is as follows:1) Pre-compute the first mid/center element, C1.2) Move the data for this mid/center element to the first item in a new array.3) Pre-compute the next two mid/center elements, that is, the ones between the start of the data and C1 (C2), and the one between C1 and the end of data (C3).4) Move the data from C2 and C3 positions to the next positions in our array, the 2nd and 3rd position.5) Keep repeating this pattern. For every iteration/split there are 2 times the amount of midpoints/data than the previous iteration. Order these linearly as the next items in our new array.When you're done, two things will occur.1) You will use a slightly different binary search algorithm.That is because you no longer need to compute a mid-point at every iteration, those are now pre-computed in the ordered array.2) Because the data is now ordered, it becomes possible to load the tip of that data into the CPU's L1, L2, and L3 cache. If let's say your binary search takes 16 iterations to complete, then you might get a good headstart of 5-8 iterations (or more, depending on cache size and data size) of that data being in cache RAM, which will make those iterations MUCH faster.Also, (and this is just me), but if your program has appropriate permissions to temporarily shut off interrupts (x86 cli sti -- or OS API equivalent), then this search can be that much faster (well, depends on what the overhead for cli/sti and API calls are, but test, test, test! (also, always shut off the network and other threads when you're testing, as they can skewer the results!) <g>)<a href="https://en.wikipedia.org/wiki/Memory_hierarchy" rel="nofollow">https://en.wikipedia.org/wiki/Memory_hierarchy</a>"Almost all programming can be viewed as an exercise in caching." --Terje Mathisen"Assume Nothing" --Mike AbrashAlso, there is no such thing as the fastest binary search algorithm... there's always a better way to do them...To paraphrase Bruce Lee:"There are no mountain tops, there is only an endless series of plateaus, and you must ever seek to go beyond them..."