Every time I see people trying to eliminate branches, I wonder, do we realize that having long pipelines where a branch misprediction stalls the pipeline is not actually a necessary part of architecture?<p>The pipeline is long because we do lots of analysis and translation on the fly, just in time, which could easily be done in most cases ahead of time, as it's not a very stateful algorithm.<p>This is how Transmeta Crusoe CPUs worked. Imagine NOT caring that you have a branch.<p>After all, if you think about it, all operations are branches. Down to bitwise operations and summing two numbers. It's branching in the ALU, carry bits and what not. You can't compute absolutely anything without looking at the state of one or more bits, and making a decision that changes the outcome. But the reason those branches don't hurt performance is that they're not branches on the main pipeline. These local branches have a very short (or non-existent) "pipeline". And the main pipeline is therefore not affected, despite the actual state of the system is.
> <i>If only there was a clean fast bare-metal language to write all this in..</i><p>The author includes a footnotes for "BUT RUST.." and "BUT ZIG..", but how about Nim? Looks like there is a native library implementation of `lowerBound` <a href="https://github.com/nim-lang/Nim/blob/version-2-0/lib/pure/algorithm.nim#L221">https://github.com/nim-lang/Nim/blob/version-2-0/lib/pure/al...</a>
Yes, it's not a "bare-metal" language, but it compiles to one (or two), so it would be interesting to see what it compiles to here.<p>Also, what's wrong with C?
Is that still lower_bound? Maybe I am misreading the code but it looks like this returns any match, not the earliest match (when there are dupes).<p>It’s common to have multiple matches even in a unique list if the comparison function is say looking for a certain string prefix to do autocomplete, but we want the earliest in the list.
I wish all blog posts started the way his does: "You’re a busy person so I’ll first jump right to it. Here it is, the fastest general (and simple) binary search C++ implementation:"
The Zig stdlib does not call out to C++ for binary search. The binary search is currently here: <a href="https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009c9662ba4c0ea77d4/lib/std/sort.zig#L400">https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...</a><p>(edit: fixed link to not decay, thanks)
I don't get it. The problem with binary search and branches is not the branches themselves, it's the fact that until you have done the comparison, you don't know which memory location in the array to fetch next. It doesn't matter if you use branches or anything else, the question is what do you want the processor to do?<p>There is a data dependency: until I read the middle index, I can't tell if I want to search the data in the upper section or the lower section. I can speculate and issue reads to both. That would fix the dependency, but create more memory traffic. Is this the right trade-off?<p>What do <i>you</i> want? Removing branches is not it.
On my Cascade Lake processor "-mllvm -x86-cmov-converter=false" almost halves the performance of the binary search:<p><pre><code> | Benchmark | gcc | clang | clang -cmov |
|-----------|------|-------|-------------|
| slow u32 | 23.4 | 46.7 | 45.8 |
| fast u32 | 18.1 | 19.8 | 31.4 |
</code></pre>
The numbers are nanoseconds/bsearch on a 100mb uint32 array. Seem to me that clang (15.0.7) is just much worse at optimizing this particular piece of code than gcc (13.2.1). You can see the assembly here <a href="https://godbolt.org/z/cbx5Kdjs6" rel="nofollow noreferrer">https://godbolt.org/z/cbx5Kdjs6</a>. The gcc assembly looks way cleaner to me.
Does anyone know where the "BUT RUST" link was <i>supposed</i> to lead? It seems to be already out of date due to being unversioned, I can't tell whether it's supposed to lead to the middle of the `starts_with` doc comment or not.
Interesting that the results don't hold up with a more complicated comp comparison function:<p>> For somewhat realistic scenarios of binary searching with a slower comp() function I’ve thought of searching through ids, phone numbers, accounts and keywords. I’ve thus settled on testing searching 8-byte strings.<p>> ...<p>> In this case std::lower_bound is very slightly but consistently faster than sb_lower_bound. To always get the best performance it is possible for libraries to use sb_lower_bound whenever directly working on primitive types and std::lower_bound otherwise.<p>I would like to see the analysis here.
It looks like the unpredictable attribute now influences the cmov conversion pass (as of June 1st, so probably in clang 17/18)<p><a href="https://reviews.llvm.org/D118118" rel="nofollow noreferrer">https://reviews.llvm.org/D118118</a>
>gcc has __builtin_expect_with_probability(cond, 0, 0.5) but it does nothing (tested v10). ↩<p>I wonder what could possibly be the use of this builtin. Branch prediction varies enough between different processors that it seems unlikely that anything useful could be done with a fine-grained probability estimate.
This is not a valid drop in replacement for lower_bound. Accessing iterators as front[index] is only for random access iterators like vector has. The author may have realized this if they benchmarked on other containers.A forward iterator must be advanced and then dereferenced.
Every post like this makes me update my junior engineer watchlist. Then I have to go patiently explain why mucking about in the code to save one instruction because you read a blog post is a horrible idea. Easily half of all silly junior code is pointless optimization.<p>Having said that, I did enjoy the post.
Oh, C++, the language where templates are extremely powerful but their usability is so poor that code like this is kind of idiomatic and even compiles:<p><pre><code> template <class ForwardIt, class T, class Compare></code></pre>
constexpr ForwardIt sb_lower_bound(
ForwardIt first, ForwardIt last, const T& value, Compare comp) {
auto length = last - first;<p>This looks generic, it compiles by itself, it contains a fancy word “ForwardIt”, and it works if you try it with a vector or an array. But you can’t actually subtract forward iterators, and binary searching a range of forward iterators is quite silly.<p>Rust would have required using the correct trait. Maybe C++ will improve the situation in the long run.
Pro tip, the footer link <a href="https://doc.rust-lang.org/src/core/slice/mod.rs.html#2520" rel="nofollow noreferrer">https://doc.rust-lang.org/src/core/slice/mod.rs.html#2520</a> is not a permalink because it doesn't include a commit ID. It looks like it's already linking to the wrong thing now (and the article is only two months old), and I don't know what the right thing was meant to be. I suggest using GitHub permalinks in future, pinned to specific commits.
Wouldnt it be faster to do length & 1 instead of length % 2 and also length >> 1 instead of length / 2? But maybe the compiler does this behind the scenes?
If the compiler doesn't optimize the if(cmp(...)) first += length + rem; perhaps a sign operation could be used when cmp returns a number. Something along the lines of:<p><pre><code> first += (length + rem) * ((sign(cmp(...)) + 1) / 2)
</code></pre>
Has anyone tried that? DDG isn't very helpful.
Branches on a comparison function, doesn't count it as simple functions end up as a conditional move. Branches in a loop back edge, doesn't count it as there's a branch predictor.<p>I was hoping for a binary search without branches in it. Not really what we got.
I am confused.<p>> Same function interface as std::lower_bound, but 2x faster, and shorter. “branchless” because the if compiles down to a conditional move instruction rather than a branch/conditional jump.<p>Assembly programmers did "fastest branchless binary searches" using cmov decades ago and we didn't need to think about the compiler at all. I know, because I was one of them. Optimizing branches away was and still is a nice pastime. This is not new, or original, and I don't understand what's noteworthy about it.<p>"Elitist", you say? Well ... yeah! At least I know what I'm doing, unlike people who solely rely on and blindly trust the compiler to do their work for them.
Other fast binary searches <a href="https://github.com/sslotin/amh-code/tree/main/binsearch">https://github.com/sslotin/amh-code/tree/main/binsearch</a>
lower_bound and upper_bound are typically implemented in terms of partition_point, which is much more general than this version of lower_bound taking an element.