Static search trees: faster than binary search

656 点作者 atombender4 个月前

26 条评论

Totally off topic, but I've started to notice that more and more algorithmic low-ish level content assumes Rust by default. My entire life I've been used to anything algorithmic being either written in sciency pseudocode (invariably in latex-generated PDFs), or plain old C(++), the vulgar latin of computer programming. I know C and I don't really know Rust but nevertheless I love that this is changing! I'd never advise a new programmer interested in low-level/fast stuff to begin with C now, so much good content is written around Rust. Like the old C examples, the post doesn't even say "this is Rust". You're expected to expect it to be Rust.And, well the language choice doesn't really matter for the core subject here. I could follow this post fine even though I don't really know Rust (just like I assume a Rust programmer could follow a well written algorithms deep dive with examples in C with a bit of dedication). EDIT okok that was a lie, I could follow the first half of this post, ish, and then it got way too deep for me.But anyway, standardize on Rust, I like it. I'd love if a bit more Zig was thrown into the mix here but either of them feel like a better default than C at this point to me. After decades of C being The Standard for this stuff, I love that this is finally changing.

评论 #42565877 未加载

评论 #42566625 未加载

评论 #42566915 未加载

评论 #42566282 未加载

评论 #42566804 未加载

评论 #42570754 未加载

评论 #42569133 未加载

orlp4 个月前

One dimension that is not explored is partitioning the queries in batches. The primary cost is doing lookups on the out-of-cache table, so if you have a sufficiently large amount of queries you can resolve a couple layers of the tree in one step while those layers are in cache, grouping them based on where they land deeper in the tree, and then resolving all those queries that touch the same deeper part of the tree in one batch as well.In theory, with an infinitely large amount of input queries you will never have to do an out-of-cache lookup, it can be completely amortized away into linear scans.That said, you now end up with a bunch of results that need to be put back into the correct output order, which likely makes it not worth it. But if the operation can be fused into a reduction (e.g. find the sum of the binary search results) or the output order does not matter in some other way then all of a sudden it might make sense.

评论 #42565947 未加载

评论 #42564103 未加载

评论 #42563508 未加载

评论 #42569202 未加载

评论 #42564253 未加载

评论 #42566708 未加载

评论 #42564175 未加载

评论 #42563882 未加载

wolfgangK4 个月前

Amazingly thorough ! I love how the author leaves no stone unturned. I had no idea you could do the kind of low level efficiency shaving in Rust. I wonder how a C++ implementation with <a href="https://github.com/jfalcou/eve">https://github.com/jfalcou/eve</a> would compare.

评论 #42563187 未加载

评论 #42566194 未加载

评论 #42565575 未加载

评论 #42563254 未加载

koverstreet4 个月前

Nifty thing about eytzinger trees is that there's a formula for converting from a node index to its position in an inorder traversal.This is useful if your primary keystore is a normal sorted set of keys - easier to work with - you then don't need to store explicit pointers.Also, he didn't mention that with eytzinger you can prefetch multiple loop iterations in advance - use 1-based indexing so that sibling nodes line up nicely on cachelines.

评论 #42567129 未加载

sujayakar4 个月前

this is unbelievably cool. ~27ns overhead for searching for a u32 in a 4GB set in memory is unreal.it's interesting that the wins for batching start diminishing at 8. I'm curious then how the subsequent optimizations fare with batch size 8 (rather than 128).smaller batch sizes are nice since it determines how much request throughput we'd need to saturate this system. at batch size 8, we need 1s / ~30ns * 8 = 266M searches per second to fully utilize this algorithm.the multithreading results are also interesting -- going from 1 to 6 threads only improves overhead by 4x. curious how this fares on a much higher core count machine.

评论 #42567014 未加载

评论 #42587481 未加载

gorset4 个月前

The number of unique int32 values is not that great, and a full bitset would only be 512MB. Hard to bit a dense bitset in performance.As a general purpose data structure, I would look at roaring bitmaps for this purpose, which has good trade-offs with great performance for most use-cases.If only simple lookups are needed over a static set, then it's worth looking at minimal perfect hash functions (<a href="https://sux.di.unimi.it" rel="nofollow">https://sux.di.unimi.it</a>).

评论 #42566805 未加载

shawn_w4 个月前

All this nattering about rust and C++, and then there's me thinking about how to do it in Common Lisp, possibly keeping the simd bits.

评论 #42564428 未加载

openquery4 个月前

This is really interesting and a thorough write up. Thanks to the author for sharing their work.Whenever I read about super low-level optimisation, my immediate feeling is that of gratitude to the author for spending so much time shaving off nanoseconds which the entire SE community gets to enjoy.I wonder how much time humanity has collectively saved simply as a result of how all these optimisations stack up.

评论 #42567158 未加载

DerSaidin4 个月前

I think there is an error in 1.7 figure 3: Layer 1 should have a 10 instead of a 12.Edit: Also, 1.3 figure 1: Should the y-axis label be "Inverse throughput" to match later figures?

评论 #42567166 未加载

topbanana4 个月前

Interesting. If you need to support occasional writes, you could have a large static search tree and a small writable tree that you check afterwards. The when you have enough changes, you could occasionally ship those changes into a new version of the static tree

westurner4 个月前

suffix-array-searching/static-search-tree/src/s_tree.rs: <a href="https://github.com/RagnarGrootKoerkamp/suffix-array-searching/blob/master/static-search-tree/src/s_tree.rs">https://github.com/RagnarGrootKoerkamp/suffix-array-searchin...</a> partitioned_s_tree.rs: <a href="https://github.com/RagnarGrootKoerkamp/suffix-array-searching/blob/master/static-search-tree/src/partitioned_s_tree.rs">https://github.com/RagnarGrootKoerkamp/suffix-array-searchin...</a>

estebarb4 个月前

Recently I tried to optimize a set implementation and found that interpolation search works quite well, being competitive with HashSet in C# (both single digit ns for my use case), with zero memory overhead. Unfortunately, it only works well with uniform data, so I guess I'll give static search trees a try. This explanation was clear and extremelly well detailed.

评论 #42567186 未加载

amluto4 个月前

Given the fact that the keys are fixed-size integers with a not-too-terrible distribution, I would consider chasing the partitioning idea a bit harder, and possibly even removing all the fancy S-tree parts. Some ideas:For 32 bits, using the first few or even a lot (16? 20?) bits to index an array and then chasing down the remaining bits somehow (adaptively depending on size of that partition?) seems like it could work.In this direction, and with really impressive asymptotic performance, there’s the van Emde Boas tree and its variants.

NooneAtAll34 个月前

I wish graph did not show 2 values with the same blue color

评论 #42563467 未加载

owenthejumper4 个月前

Elastic Binary Trees <a href="https://wtarreau.blogspot.com/2011/12/elastic-binary-trees-ebtree.html?m=1" rel="nofollow">https://wtarreau.blogspot.com/2011/12/elastic-binary-trees-e...</a>

jules4 个月前

Nice post. Would the larger amount of code result in different performance in a scenario where other code is being run as well, or would the instruction cache be large enough to make this a non-issue?

nielsole4 个月前

I've never had the use case for this amount of throughput. What are use-cases for this. From the website I presume this is gene-matching related?

评论 #42569621 未加载

评论 #42571043 未加载

评论 #42587416 未加载

vlovich1234 个月前

Is binary search on su ch large integer data sets common? I guess maybe time series data. But I would think that this isn’t amenable to representing strings?

评论 #42565535 未加载

swiftcoder4 个月前

Absolutely love the level of detail here! Not often you see the process of optimising something spelled out in so much depth

anamax4 个月前

This looks like something that patricia trees would be good for.Then again, Eytzinger looks like binary heaps.

purple-leafy4 个月前

cant understand the comp sci stuff yet, but love the way this site looks

评论 #42567253 未加载

jokoon4 个月前

so a tree when the data cannot be updated?What is the point?

评论 #42569692 未加载

astronautas4 个月前

Cool! Thanks for the investigation.

xpe4 个月前

I predict that more and more of these kinds of optimized algorithms will be designed / written / assisted by machines.I wonder when the next such malicious algorithm(s) passes under our noses, and what its hidden purpose might be.

ryao4 个月前

This would have been easier for a larger number of people to read if it had been in C or Python.

评论 #42563280 未加载

评论 #42563172 未加载

评论 #42563281 未加载

评论 #42563217 未加载

评论 #42563158 未加载

评论 #42563160 未加载

评论 #42563170 未加载

评论 #42565039 未加载

评论 #42563829 未加载

评论 #42563200 未加载

评论 #42563221 未加载

评论 #42563894 未加载

评论 #42563196 未加载

colesantiago4 个月前

This is a great new interview question for junior and senior software engineering candidates.Definitely going to use this for our next round of interviews.

评论 #42564242 未加载

评论 #42587362 未加载