Krapivin made this breakthrough by being unaware of Yao's conjecture.<p>The developer of Balatro made an award winning deck builder game by not being aware of existing deck builders.<p>I'm beginning to think that the best way to approach a problem is by either not being aware of or disregarding most of the similar efforts that came before. This makes me kind of sad, because the current world is so interconnected, that we rarely see such novelty with their tendency to "fall in the rut of thought" of those that came before. The internet is great, but it also homogenizes the world of thought, and that kind of sucks.
Ok, big shout out to monort [0] for the link to the video [1].<p>This is just a quick overview from a single viewing of the video, but it's called "funnel hashing". The idea is to split into exponentially smaller sub arrays, so the first chunk is n/m, the second is n/(m^2), etc. until you get down to a single element. Call them A0, A1, etc., so |A0| = n/m, |A1| = n/(m^2) etc., k levels in total.<p>Try inserting into A0 c times. If it fails, try inserting into A1 c times. If it fails, go down the "funnel" until you find a free slot.<p>Call \delta the fraction of slots that are empty (I'm unclear if this is a parameter that gets set at hash table creation or one that's dynamically updated). Setting c = log(1/d) and k = log(1/d) to get worst case complexity O(log^2(1/d)).<p>This circumvents Yao's result by not being greedy. Yao's result holds true for greedy insertion and search policies and the above is non-greedy, as it's cascading down the funnels.<p>There are probably many little hairy details to work out but that's the idea, as far as I've been able to understand it. People should let me know if I'm way off base.<p>This very much reminds me of the "Distinct Elements in Streams" idea by Chakraborty, Vinodchandran and Meel[2].<p>[0] <a href="https://news.ycombinator.com/item?id=43007860">https://news.ycombinator.com/item?id=43007860</a><p>[1] <a href="https://www.youtube.com/watch?v=ArQNyOU1hyE" rel="nofollow">https://www.youtube.com/watch?v=ArQNyOU1hyE</a><p>[2] <a href="https://arxiv.org/pdf/2301.10191" rel="nofollow">https://arxiv.org/pdf/2301.10191</a>
Talk by the inventor: <a href="https://www.youtube.com/watch?v=ArQNyOU1hyE" rel="nofollow">https://www.youtube.com/watch?v=ArQNyOU1hyE</a>
Skimming the paper [1], the key difference they used is that their hash table insertion algorithm will probe further than the first empty slot, instead of greedily filling the first empty slot it finds. They combine this with a clever probing sequence which provably finds empty slots efficiently, even if the table is very full.<p>This means insertions when the hash table is less full are slower, but you avoid the worst-case scenario where you're probing for the last (few) remaining open slot(s) without any idea as to where they are.<p>[1]: <a href="https://arxiv.org/pdf/2501.02305" rel="nofollow">https://arxiv.org/pdf/2501.02305</a><p>---<p>An interesting theoretical result but I would expect the current 'trick' of simply allocating a larger table than necessary to be the superior solution in practice. For example, Rust's hashbrown intentionally leaves 1/8th (12.5%) of the table empty, which does cost a bit more memory but makes insertions/lookups very fast with high probability.
This is neat! I always wondered if there would be a way to 'containerize' tables like this. IE a regular table is like a bulk carrier ship, with everything stuffed into it. If you could better organize it like a container ship, you could carry much more stuff more efficiently (and offload it faster too!)
The theoretical properties of hash table always seemed so impressive to me that they bordered on magic (and this just extends them). What seemed crazy was how they could be so much better than trees, which to me were intuitively the most efficient way to store data.<p>What I realized is that the theory of hash tables involves a fixed-sized collection of objects. For this fixed collection, you create a hash-function and used that like a vector-index and store the collection in a (pre-allocated) vector. This gives a (fuzzy-lens'd) recipe for O(1) time insert, deletion and look-up. (The various tree structures, in contrast, don't assume a particular size).<p>The two problems are you have to decide size beforehand and if your vector gets close to full, you insert etc processes might bog-down. So scanning the article, it seems this is a solution to the bogging down part - it allows quick insertion to a nearly-full table. It seems interesting and clever but actually not a great practical advance. In practice, rather than worrying a clever way to fill the table, I'd assume you just increase your assumed size.<p>Edit: I'm posting partly to test my understanding, so feel to correct me if I'm not getting something.
> And for this new hash table, the time required for worst-case queries and insertions is proportional to (log x)2 — far faster than x.<p>> The team’s results may not lead to any immediate applications<p>I don't understand why it wouldn't lead to immediate applications. Is this a situation where analysis of real-world use cases allows you to tune your hash implementation better than what a purely mathematical approach would get you?
It looks like the result only matters in the case where the hash table is close to full. But couldn't one just deal with this case by making the table size 10% bigger? (Or, if it is resizeable, resizing earlier)
The intro picture about pointers in a drawer immediately reminded me of a talk I saw at FUN with Algorithms 2018 called Mind the Gap that gave me an aha moment about leaving space in data structures. Cool then to try to locate it, and see that it was by the same professor in the article, Martín Farach-Colton.<p>Not sure if it's viewable somewhere. But the conference itself was so fun. <a href="https://sites.google.com/view/fun2018/home" rel="nofollow">https://sites.google.com/view/fun2018/home</a><p>I'm not an academic and got my company to sponsor a trip to this Italian island to relax on the beach and watch fun talks, heh.
Neat, started on some implementation: <a href="https://kraftwerk.social/innovation-in-hash-tables/" rel="nofollow">https://kraftwerk.social/innovation-in-hash-tables/</a>
For a different, perhaps more practical take on small pointers in hash tables, you might find this interesting: <a href="https://probablydance.com/2018/05/28/a-new-fast-hash-table-in-response-to-googles-new-fast-hash-table/" rel="nofollow">https://probablydance.com/2018/05/28/a-new-fast-hash-table-i...</a> with contemporaneous discussion at <a href="https://news.ycombinator.com/item?id=17176713">https://news.ycombinator.com/item?id=17176713</a>
For anyone looking for a PoC implementation, here's python:<p><a href="https://github.com/sternma/optopenhash">https://github.com/sternma/optopenhash</a>
I guess the most we could hope for here is that this leads to some other discovery down the road, either in hashtables or maybe one of the similar structures like bloom filters?
I would like to see this being applied practically. Is there a video demonstrating this or is it still too soon? Is the algorithm secret sauce or will it be open sourced?
Anyone else think this could be used with distributed hash tables to dramatically speed up searching or building them? Maybe more exoticly to LLMs and lookup tables. A clever algorithm like this should be applicable in a lot of more specialized data structures or applications.<p>It's likely a DHT would greatly benefit from this sort of algorithmic reduction in time and be less susceptible to constant factor overheads (if there are any).
(2021) for the paper itself<p><a href="https://arxiv.org/abs/2111.12800" rel="nofollow">https://arxiv.org/abs/2111.12800</a>
I unfortunately did not study well enough to understand the paper.<p>Can someone explain to me how this isn't some kind of Dewey Decimal Classification (<a href="https://en.wikipedia.org/wiki/Dewey_Decimal_Classification" rel="nofollow">https://en.wikipedia.org/wiki/Dewey_Decimal_Classification</a>) ?
Read this within my half hour break and man, wow what a story. I'm not a software guy, I'm a sys and net guy. Despite not caring or knowing about hash tables, that articles a great read! Thanks for sharing!
The paper is here: <a href="https://arxiv.org/pdf/2111.12800" rel="nofollow">https://arxiv.org/pdf/2111.12800</a><p>Curiously, Andrew Krapivin, the genious undergrad in the article, is not one of the authors.
Step one: Be a genius<p>Step two: Try to solve hard problems<p>Step three: Avoid reading too much of other people's work in the area<p>Step four: (Maybe) Invent a brilliant new solution<p>But really, really don't skip step one.
The older a conjecture is, the more likely it is false.<p>That's why the conjecture resists proof -- there is an counterexample that people aren't seeing.
This is a good test because it’s recent. Let’s see if deep research can come up with this result without just copying this.<p>Edit: gpt4, Gemini 2 and Claude had no luck. Human driven computer science is still safe.
I bet this guy would still fail a first round FAANG developer interview requiring a Hash Table solution to move on in the process.<p>"Yeah, sorry. You didn't use the right Hash Table"
"it is well known that a vital ingredient of success is not knowing that what you are attempting can’t be done."
— Terry Pratchett (equal rites)
Reading through this article is like reading a description of the Monty-Hall problem. [0]<p>It's as through the conclusion seems to defy common sense, yet is provable. [1]<p>[0] - <a href="https://priceonomics.com/the-time-everyone-corrected-the-worlds-smartest/" rel="nofollow">https://priceonomics.com/the-time-everyone-corrected-the-wor...</a><p>[1] - 2nd to the last paragraph: "The fact that you can achieve a constant average query time, regardless of the hash table’s fullness, was wholly unexpected — even to the authors themselves."
i feel this article is missing some detail or incorrect in reporting the actual development here. either that or i am missing something myself...<p>hash tables are constant time on average for all insertion, lookup and deletion operations, and in some special cases, which i've seen used in practice very, very often, they have very small constant run-time just like a fixed-size array (exactly equivalent in-fact).<p>this came up in an interview question i had in 2009 where i got judged poorly for deriding the structure as "not something i've often needed", and i've seen it in much older code.<p>i'm guessing maybe there are constraints at play here, like having to support unbounded growth, and some generic use case that i've not encountered in the wild...?
This is cool enough. But I find the "celebrification" style of the piece a bit off putting. Did I really need to see multiple posed shots of this young man reposing in various university settings? It's like we need our own version of La La Land to glorify the survivors of computer success to motivate more to participate.
Now we have faster data structures we can fill that extra time by writing less efficient code, and loading more pointless libraries. This is the march of computer science.
As the villain in <i>Scooby Doo</i> always said:<p><i>"And I would have gotten away with it, if it hadn't been for those meddling kids!"</i>
I read through this and I'm not sure if people have heard of dictionary trees for hash tables. Of course, quantamagazine.org has been known to sensationalize these types of things.