As a side note, if you have the possibility to sort data before searching values in it then implement trie:<p><a href="https://en.wikipedia.org/wiki/Trie" rel="nofollow">https://en.wikipedia.org/wiki/Trie</a><p>This is the best for data with very few updates/modifications and a lot of queries. One example would be points of interest on a map, like restaurants on google maps. Not like those restaurants gets updated/modified every single day but you do have clients that while using your restaurant finder app they query a lot of them on a given radius around their GPS coordinates.<p>I had such a project in the past. My client acquired this data, around 3 billion points of interests residing in a PostgreSQL DB that was around 900 GB. While the server was beefed with 128GB RAM, was not nearly enough to have it all in memory and it had a responsiveness of several seconds, way too much for young impatient users and my client was seeing a decline in users. I reorganized data in PGSQL to fit a trie tree and now a query was solved in milliseconds. Accounting all other stuff on top, the app was now able to generate a response under a second. Trie ftw.
A related technique is generating an index using a machine learned model (well explained here: <a href="https://blog.acolyer.org/2018/01/08/the-case-for-learned-index-structures-part-i/" rel="nofollow">https://blog.acolyer.org/2018/01/08/the-case-for-learned-ind...</a>). In interpolation search, we just use a simple linear model. Kraska and co. use much more complex models to beat BTree indexes. However, the Recursive Model Indexes (RMIs) that they published (<a href="https://github.com/learnedsystems/RMI" rel="nofollow">https://github.com/learnedsystems/RMI</a>) use Rust to a C++ model that is exported as a library, so it definitely doesn't seem ready for production yet!
I think Microsoft uses with btrees:<p><a href="https://github.com/microsoft/ALEX" rel="nofollow">https://github.com/microsoft/ALEX</a><p>Of course they call it "ML", but it's just a linear interpolation for faster tree search.
One good thing about binary search is memory access pattern.<p>The element in the middle of the array is tested every single time. Similarly, the elements at 25% and 75% of the array are tested very often, in 50% searches/each.<p>Modern computers have very slow memory compared to computation speed, and multi-level caches to compensate. Binary search RAM access pattern, when done many times, is friendly towards these multi-level caches. The elements at positions like 25%, 50%, 75% will stay in L1D cache.
In college I played with a variation of this where you fix the size of the collection to a power of 2 (sprinkling in zeroes/repeated values if needed): you make your initial guess for where to look is just by taking the top bits of the searched-for value, then estimate the number of slots off you were from the top bits of the _difference_ between the value you found and the one you want (clamping your next guess to the first/last item if you'd fall off the end otherwise). In the concrete example I was playing with, it worked to just stop there then do linear search, but you could be cleverer about how long to iterate and when/how to bail out to something with a better worst case.<p>As Lemire says the problem (a good "problem"!) is that hashtables and B-trees can be quite fast, robust against different distributions, and you have them right at hand. It'd almost have to be some weird situation like you're handed data in a format that kinda fits the requirements and you're looking for how to work with it in-place.
Neat trick. Possibly a bit brittle, since you have to maintain your knowledge of the distribution somewhere -- i.e. it's "state you have to update or things might break," which is always worrisome.<p>I wonder what the worst case performance is -- if your guess is completely wrong, will it still converge in no worse than O(log(N))? Do you guess only once, at the beginning, or do you keep guessing where to go based on the distribution?<p>If you only guess once, at the beginning, then there's probably no risk. But if you keep trying to aim for the wrong target every iteration, I wouldn't be surprised if you get pathological behavior.<p>It would be kind of fun to try to calculate or simulate how bad it can get if you keep making incorrect guesses.
I converted his code to python.
<a href="https://gist.github.com/brianherman/7e58b48ddbb7060663139416ed4901fa" rel="nofollow">https://gist.github.com/brianherman/7e58b48ddbb7060663139416...</a>
You don't want to pivot on the average, you want to pivot on the median. You will gain the most information with every step if 50% of the probability mass is above your guess and 50% is below.
The problem I see is the assumption of a uniform distribution.<p>The problem spaces I normally deal with involve clumps of data, not really uniform except within each clump.