Fastest branchless binary search

338 pointsby jorangreefalmost 2 years ago

29 comments

Every time I see people trying to eliminate branches, I wonder, do we realize that having long pipelines where a branch misprediction stalls the pipeline is not actually a necessary part of architecture?The pipeline is long because we do lots of analysis and translation on the fly, just in time, which could easily be done in most cases ahead of time, as it's not a very stateful algorithm.This is how Transmeta Crusoe CPUs worked. Imagine NOT caring that you have a branch.After all, if you think about it, all operations are branches. Down to bitwise operations and summing two numbers. It's branching in the ALU, carry bits and what not. You can't compute absolutely anything without looking at the state of one or more bits, and making a decision that changes the outcome. But the reason those branches don't hurt performance is that they're not branches on the main pipeline. These local branches have a very short (or non-existent) "pipeline". And the main pipeline is therefore not affected, despite the actual state of the system is.

评论 #37090995 未加载

评论 #37093078 未加载

评论 #37089800 未加载

评论 #37089566 未加载

评论 #37091048 未加载

评论 #37089644 未加载

评论 #37092447 未加载

评论 #37092145 未加载

shmageggyalmost 2 years ago

> If only there was a clean fast bare-metal language to write all this in..The author includes a footnotes for "BUT RUST.." and "BUT ZIG..", but how about Nim? Looks like there is a native library implementation of `lowerBound` <a href="https://github.com/nim-lang/Nim/blob/version-2-0/lib/pure/algorithm.nim#L221">https://github.com/nim-lang/Nim/blob/version-2-0/lib/pure/al...</a> Yes, it's not a "bare-metal" language, but it compiles to one (or two), so it would be interesting to see what it compiles to here.Also, what's wrong with C?

评论 #37087079 未加载

评论 #37088810 未加载

评论 #37088246 未加载

w0mbatalmost 2 years ago

Is that still lower_bound? Maybe I am misreading the code but it looks like this returns any match, not the earliest match (when there are dupes).It’s common to have multiple matches even in a unique list if the comparison function is say looking for a certain string prefix to do autocomplete, but we want the earliest in the list.

评论 #37089400 未加载

评论 #37088318 未加载

评论 #37092693 未加载

hajricealmost 2 years ago

I wish all blog posts started the way his does: "You’re a busy person so I’ll first jump right to it. Here it is, the fastest general (and simple) binary search C++ implementation:"

评论 #37088773 未加载

anonymoushnalmost 2 years ago

The Zig stdlib does not call out to C++ for binary search. The binary search is currently here: <a href="https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009c9662ba4c0ea77d4/lib/std/sort.zig#L400">https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...</a>(edit: fixed link to not decay, thanks)

评论 #37087257 未加载

评论 #37093969 未加载

alain94040almost 2 years ago

I don't get it. The problem with binary search and branches is not the branches themselves, it's the fact that until you have done the comparison, you don't know which memory location in the array to fetch next. It doesn't matter if you use branches or anything else, the question is what do you want the processor to do?There is a data dependency: until I read the middle index, I can't tell if I want to search the data in the upper section or the lower section. I can speculate and issue reads to both. That would fix the dependency, but create more memory traffic. Is this the right trade-off?What do you want? Removing branches is not it.

评论 #37093387 未加载

评论 #37100595 未加载

评论 #37091906 未加载

bjournealmost 2 years ago

On my Cascade Lake processor "-mllvm -x86-cmov-converter=false" almost halves the performance of the binary search:<pre><code> | Benchmark | gcc | clang | clang -cmov | |-----------|------|-------|-------------| | slow u32 | 23.4 | 46.7 | 45.8 | | fast u32 | 18.1 | 19.8 | 31.4 | </code></pre> The numbers are nanoseconds/bsearch on a 100mb uint32 array. Seem to me that clang (15.0.7) is just much worse at optimizing this particular piece of code than gcc (13.2.1). You can see the assembly here <a href="https://godbolt.org/z/cbx5Kdjs6" rel="nofollow noreferrer">https://godbolt.org/z/cbx5Kdjs6</a>. The gcc assembly looks way cleaner to me.

评论 #37094085 未加载

LoganDarkalmost 2 years ago

Does anyone know where the "BUT RUST" link was supposed to lead? It seems to be already out of date due to being unversioned, I can't tell whether it's supposed to lead to the middle of the `starts_with` doc comment or not.

评论 #37087281 未加载

评论 #37093318 未加载

djoldmanalmost 2 years ago

Interesting that the results don't hold up with a more complicated comp comparison function:> For somewhat realistic scenarios of binary searching with a slower comp() function I’ve thought of searching through ids, phone numbers, accounts and keywords. I’ve thus settled on testing searching 8-byte strings.> ...> In this case std::lower_bound is very slightly but consistently faster than sb_lower_bound. To always get the best performance it is possible for libraries to use sb_lower_bound whenever directly working on primitive types and std::lower_bound otherwise.I would like to see the analysis here.

评论 #37087523 未加载

评论 #37095384 未加载

评论 #37087295 未加载

funkychickenalmost 2 years ago

It looks like the unpredictable attribute now influences the cmov conversion pass (as of June 1st, so probably in clang 17/18)<a href="https://reviews.llvm.org/D118118" rel="nofollow noreferrer">https://reviews.llvm.org/D118118</a>

foldralmost 2 years ago

>gcc has __builtin_expect_with_probability(cond, 0, 0.5) but it does nothing (tested v10). ↩I wonder what could possibly be the use of this builtin. Branch prediction varies enough between different processors that it seems unlikely that anything useful could be done with a fine-grained probability estimate.

评论 #37087065 未加载

评论 #37087114 未加载

评论 #37087442 未加载

评论 #37090877 未加载

评论 #37088334 未加载

andyg_blogalmost 2 years ago

This is not a valid drop in replacement for lower_bound. Accessing iterators as front[index] is only for random access iterators like vector has. The author may have realized this if they benchmarked on other containers.A forward iterator must be advanced and then dereferenced.

评论 #37088200 未加载

评论 #37088639 未加载

评论 #37094607 未加载

ectosphenoalmost 2 years ago

Every post like this makes me update my junior engineer watchlist. Then I have to go patiently explain why mucking about in the code to save one instruction because you read a blog post is a horrible idea. Easily half of all silly junior code is pointless optimization.Having said that, I did enjoy the post.

评论 #37088846 未加载

amlutoalmost 2 years ago

Oh, C++, the language where templates are extremely powerful but their usability is so poor that code like this is kind of idiomatic and even compiles:<pre><code> template <class ForwardIt, class T, class Compare></code></pre> constexpr ForwardIt sb_lower_bound( ForwardIt first, ForwardIt last, const T& value, Compare comp) { auto length = last - first;This looks generic, it compiles by itself, it contains a fancy word “ForwardIt”, and it works if you try it with a vector or an array. But you can’t actually subtract forward iterators, and binary searching a range of forward iterators is quite silly.Rust would have required using the correct trait. Maybe C++ will improve the situation in the long run.

edg-lalmost 2 years ago

Check out <a href="https://en.algorithmica.org/hpc/data-structures/binary-search/" rel="nofollow noreferrer">https://en.algorithmica.org/hpc/data-structures/binary-searc...</a>

fl0kialmost 2 years ago

Pro tip, the footer link <a href="https://doc.rust-lang.org/src/core/slice/mod.rs.html#2520" rel="nofollow noreferrer">https://doc.rust-lang.org/src/core/slice/mod.rs.html#2520</a> is not a permalink because it doesn't include a commit ID. It looks like it's already linking to the wrong thing now (and the article is only two months old), and I don't know what the right thing was meant to be. I suggest using GitHub permalinks in future, pinned to specific commits.

postalratalmost 2 years ago

How can you call it branchless if it has "while (length > 0) {"

评论 #37089887 未加载

评论 #37090015 未加载

评论 #37092750 未加载

评论 #37091785 未加载

评论 #37090011 未加载

jezzealmost 2 years ago

Wouldnt it be faster to do length & 1 instead of length % 2 and also length >> 1 instead of length / 2? But maybe the compiler does this behind the scenes?

评论 #37088542 未加载

评论 #37088172 未加载

评论 #37088173 未加载

tgvalmost 2 years ago

If the compiler doesn't optimize the if(cmp(...)) first += length + rem; perhaps a sign operation could be used when cmp returns a number. Something along the lines of:<pre><code> first += (length + rem) * ((sign(cmp(...)) + 1) / 2) </code></pre> Has anyone tried that? DDG isn't very helpful.

JonChesterfieldalmost 2 years ago

Branches on a comparison function, doesn't count it as simple functions end up as a conditional move. Branches in a loop back edge, doesn't count it as there's a branch predictor.I was hoping for a binary search without branches in it. Not really what we got.

MrYellowPalmost 2 years ago

I am confused.> Same function interface as std::lower_bound, but 2x faster, and shorter. “branchless” because the if compiles down to a conditional move instruction rather than a branch/conditional jump.Assembly programmers did "fastest branchless binary searches" using cmov decades ago and we didn't need to think about the compiler at all. I know, because I was one of them. Optimizing branches away was and still is a nice pastime. This is not new, or original, and I don't understand what's noteworthy about it."Elitist", you say? Well ... yeah! At least I know what I'm doing, unlike people who solely rely on and blindly trust the compiler to do their work for them.

sixthDotalmost 2 years ago

<a href="https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/" rel="nofollow noreferrer">https://probablydance.com/2023/04/27/beautiful-branchless-bi...</a>

WinLycheealmost 2 years ago

Other fast binary searches <a href="https://github.com/sslotin/amh-code/tree/main/binsearch">https://github.com/sslotin/amh-code/tree/main/binsearch</a>

commonlisp94almost 2 years ago

lower_bound and upper_bound are typically implemented in terms of partition_point, which is much more general than this version of lower_bound taking an element.

评论 #37089268 未加载

sesuximoalmost 2 years ago

I’m too lazy to try it, but putting __builtin_unpredictable around the implementation of comp might make std::lower_bound less branchy

1letterunixnamealmost 2 years ago

It's not branchless if there are are any Jcc other than JMP because that's an opportunity for a branch prediction miss and a pipeline stall.

victor106almost 2 years ago

Anyone knows if Java already does this optimization?

up2isomorphismalmost 2 years ago

Learn assembly: you will get these “tricks” automatically.To be honest it is way more efficient to read these long posts.

notoranditalmost 2 years ago

Branchless?You need at least 1 branch to exit the loop.

评论 #37090345 未加载