Undergraduate shows that searches within hash tables can be much faster

1304 pointsby Jhsto3 months ago

53 comments

brink3 months ago

Krapivin made this breakthrough by being unaware of Yao's conjecture.The developer of Balatro made an award winning deck builder game by not being aware of existing deck builders.I'm beginning to think that the best way to approach a problem is by either not being aware of or disregarding most of the similar efforts that came before. This makes me kind of sad, because the current world is so interconnected, that we rarely see such novelty with their tendency to "fall in the rut of thought" of those that came before. The internet is great, but it also homogenizes the world of thought, and that kind of sucks.

评论 #43005279 未加载

评论 #43006985 未加载

评论 #43009738 未加载

评论 #43008513 未加载

评论 #43008267 未加载

评论 #43008794 未加载

评论 #43006041 未加载

评论 #43006352 未加载

评论 #43005318 未加载

评论 #43007115 未加载

评论 #43005414 未加载

评论 #43006026 未加载

评论 #43011830 未加载

评论 #43008157 未加载

评论 #43006094 未加载

评论 #43008791 未加载

评论 #43009349 未加载

评论 #43007370 未加载

评论 #43006668 未加载

评论 #43007151 未加载

评论 #43005391 未加载

评论 #43009691 未加载

评论 #43016317 未加载

评论 #43007944 未加载

评论 #43014789 未加载

评论 #43016377 未加载

评论 #43012421 未加载

评论 #43012120 未加载

评论 #43006195 未加载

评论 #43016379 未加载

评论 #43009469 未加载

评论 #43010858 未加载

评论 #43013993 未加载

评论 #43013077 未加载

评论 #43008289 未加载

评论 #43011432 未加载

评论 #43005684 未加载

评论 #43009688 未加载

评论 #43012239 未加载

评论 #43012513 未加载

评论 #43009901 未加载

评论 #43012940 未加载

评论 #43007719 未加载

评论 #43008249 未加载

评论 #43010355 未加载

评论 #43006060 未加载

评论 #43006966 未加载

评论 #43010592 未加载

评论 #43011084 未加载

评论 #43009496 未加载

评论 #43006717 未加载

评论 #43010921 未加载

评论 #43007542 未加载

评论 #43011171 未加载

评论 #43007248 未加载

评论 #43006627 未加载

评论 #43007789 未加载

评论 #43012151 未加载

评论 #43009596 未加载

评论 #43012102 未加载

评论 #43007384 未加载

评论 #43006679 未加载

评论 #43007146 未加载

评论 #43008139 未加载

评论 #43006121 未加载

评论 #43012590 未加载

评论 #43007462 未加载

评论 #43017382 未加载

评论 #43012458 未加载

评论 #43005710 未加载

评论 #43006303 未加载

评论 #43008568 未加载

评论 #43012966 未加载

评论 #43008945 未加载

评论 #43007659 未加载

评论 #43007728 未加载

评论 #43008189 未加载

评论 #43011667 未加载

评论 #43011979 未加载

评论 #43012112 未加载

评论 #43007437 未加载

评论 #43008649 未加载

评论 #43007527 未加载

评论 #43009980 未加载

评论 #43010819 未加载

评论 #43009824 未加载

评论 #43005816 未加载

评论 #43013358 未加载

评论 #43008762 未加载

评论 #43008288 未加载

评论 #43007992 未加载

评论 #43007121 未加载

评论 #43007079 未加载

评论 #43006276 未加载

评论 #43010928 未加载

评论 #43007074 未加载

评论 #43012638 未加载

评论 #43009867 未加载

评论 #43009820 未加载

abetusk3 months ago

Ok, big shout out to monort [0] for the link to the video [1].This is just a quick overview from a single viewing of the video, but it's called "funnel hashing". The idea is to split into exponentially smaller sub arrays, so the first chunk is n/m, the second is n/(m^2), etc. until you get down to a single element. Call them A0, A1, etc., so |A0| = n/m, |A1| = n/(m^2) etc., k levels in total.Try inserting into A0 c times. If it fails, try inserting into A1 c times. If it fails, go down the "funnel" until you find a free slot.Call \delta the fraction of slots that are empty (I'm unclear if this is a parameter that gets set at hash table creation or one that's dynamically updated). Setting c = log(1/d) and k = log(1/d) to get worst case complexity O(log^2(1/d)).This circumvents Yao's result by not being greedy. Yao's result holds true for greedy insertion and search policies and the above is non-greedy, as it's cascading down the funnels.There are probably many little hairy details to work out but that's the idea, as far as I've been able to understand it. People should let me know if I'm way off base.This very much reminds me of the "Distinct Elements in Streams" idea by Chakraborty, Vinodchandran and Meel[2].[0] <a href="https://news.ycombinator.com/item?id=43007860">https://news.ycombinator.com/item?id=43007860</a>[1] <a href="https://www.youtube.com/watch?v=ArQNyOU1hyE" rel="nofollow">https://www.youtube.com/watch?v=ArQNyOU1hyE</a>[2] <a href="https://arxiv.org/pdf/2301.10191" rel="nofollow">https://arxiv.org/pdf/2301.10191</a>

评论 #43010840 未加载

评论 #43009986 未加载

评论 #43011296 未加载

评论 #43066405 未加载

评论 #43010409 未加载

monort3 months ago

Talk by the inventor: <a href="https://www.youtube.com/watch?v=ArQNyOU1hyE" rel="nofollow">https://www.youtube.com/watch?v=ArQNyOU1hyE</a>

评论 #43009839 未加载

评论 #43009777 未加载

评论 #43008686 未加载

orlp3 months ago

Skimming the paper [1], the key difference they used is that their hash table insertion algorithm will probe further than the first empty slot, instead of greedily filling the first empty slot it finds. They combine this with a clever probing sequence which provably finds empty slots efficiently, even if the table is very full.This means insertions when the hash table is less full are slower, but you avoid the worst-case scenario where you're probing for the last (few) remaining open slot(s) without any idea as to where they are.[1]: <a href="https://arxiv.org/pdf/2501.02305" rel="nofollow">https://arxiv.org/pdf/2501.02305</a>---An interesting theoretical result but I would expect the current 'trick' of simply allocating a larger table than necessary to be the superior solution in practice. For example, Rust's hashbrown intentionally leaves 1/8th (12.5%) of the table empty, which does cost a bit more memory but makes insertions/lookups very fast with high probability.

评论 #43005736 未加载

评论 #43005200 未加载

trebligdivad3 months ago

Anyone got a simple implementation of 'Tiny pointers'? My mind prefers code/pseudo-code first rather than the proof.

quantum20223 months ago

This is neat! I always wondered if there would be a way to 'containerize' tables like this. IE a regular table is like a bulk carrier ship, with everything stuffed into it. If you could better organize it like a container ship, you could carry much more stuff more efficiently (and offload it faster too!)

评论 #43012444 未加载

joe_the_user3 months ago

The theoretical properties of hash table always seemed so impressive to me that they bordered on magic (and this just extends them). What seemed crazy was how they could be so much better than trees, which to me were intuitively the most efficient way to store data.What I realized is that the theory of hash tables involves a fixed-sized collection of objects. For this fixed collection, you create a hash-function and used that like a vector-index and store the collection in a (pre-allocated) vector. This gives a (fuzzy-lens'd) recipe for O(1) time insert, deletion and look-up. (The various tree structures, in contrast, don't assume a particular size).The two problems are you have to decide size beforehand and if your vector gets close to full, you insert etc processes might bog-down. So scanning the article, it seems this is a solution to the bogging down part - it allows quick insertion to a nearly-full table. It seems interesting and clever but actually not a great practical advance. In practice, rather than worrying a clever way to fill the table, I'd assume you just increase your assumed size.Edit: I'm posting partly to test my understanding, so feel to correct me if I'm not getting something.

评论 #43005198 未加载

评论 #43008015 未加载

评论 #43005824 未加载

评论 #43005169 未加载

default-kramer3 months ago

> And for this new hash table, the time required for worst-case queries and insertions is proportional to (log x)2 — far faster than x.> The team’s results may not lead to any immediate applicationsI don't understand why it wouldn't lead to immediate applications. Is this a situation where analysis of real-world use cases allows you to tune your hash implementation better than what a purely mathematical approach would get you?

评论 #43005003 未加载

评论 #43006620 未加载

评论 #43009105 未加载

评论 #43005510 未加载

评论 #43004992 未加载

评论 #43005215 未加载

评论 #43005195 未加载

评论 #43008088 未加载

评论 #43005456 未加载

评论 #43006879 未加载

dooglius3 months ago

It looks like the result only matters in the case where the hash table is close to full. But couldn't one just deal with this case by making the table size 10% bigger? (Or, if it is resizeable, resizing earlier)

评论 #43005975 未加载

评论 #43014286 未加载

throwme_1233 months ago

Is someone aware of a GitHub repo with an implementation of this?

评论 #43010373 未加载

matsemann3 months ago

The intro picture about pointers in a drawer immediately reminded me of a talk I saw at FUN with Algorithms 2018 called Mind the Gap that gave me an aha moment about leaving space in data structures. Cool then to try to locate it, and see that it was by the same professor in the article, Martín Farach-Colton.Not sure if it's viewable somewhere. But the conference itself was so fun. <a href="https://sites.google.com/view/fun2018/home" rel="nofollow">https://sites.google.com/view/fun2018/home</a>I'm not an academic and got my company to sponsor a trip to this Italian island to relax on the beach and watch fun talks, heh.

ThinkBeat3 months ago

Do we have some nice implementations yet? I do better reading code than math.

_1tan3 months ago

Neat, started on some implementation: <a href="https://kraftwerk.social/innovation-in-hash-tables/" rel="nofollow">https://kraftwerk.social/innovation-in-hash-tables/</a>

cb3213 months ago

For a different, perhaps more practical take on small pointers in hash tables, you might find this interesting: <a href="https://probablydance.com/2018/05/28/a-new-fast-hash-table-in-response-to-googles-new-fast-hash-table/" rel="nofollow">https://probablydance.com/2018/05/28/a-new-fast-hash-table-i...</a> with contemporaneous discussion at <a href="https://news.ycombinator.com/item?id=17176713">https://news.ycombinator.com/item?id=17176713</a>

sternma3 months ago

For anyone looking for a PoC implementation, here's python:<a href="https://github.com/sternma/optopenhash">https://github.com/sternma/optopenhash</a>

评论 #43012512 未加载

foota3 months ago

I guess the most we could hope for here is that this leads to some other discovery down the road, either in hashtables or maybe one of the similar structures like bloom filters?

nexawave-ai3 months ago

I would like to see this being applied practically. Is there a video demonstrating this or is it still too soon? Is the algorithm secret sauce or will it be open sourced?

elcritch3 months ago

Anyone else think this could be used with distributed hash tables to dramatically speed up searching or building them? Maybe more exoticly to LLMs and lookup tables. A clever algorithm like this should be applicable in a lot of more specialized data structures or applications.It's likely a DHT would greatly benefit from this sort of algorithmic reduction in time and be less susceptible to constant factor overheads (if there are any).

froh3 months ago

(2021) for the paper itself<a href="https://arxiv.org/abs/2111.12800" rel="nofollow">https://arxiv.org/abs/2111.12800</a>

评论 #43010502 未加载

Canigou3 months ago

I unfortunately did not study well enough to understand the paper.Can someone explain to me how this isn't some kind of Dewey Decimal Classification (<a href="https://en.wikipedia.org/wiki/Dewey_Decimal_Classification" rel="nofollow">https://en.wikipedia.org/wiki/Dewey_Decimal_Classification</a>) ?

duskwuff3 months ago

Paper: <a href="https://arxiv.org/pdf/2111.12800" rel="nofollow">https://arxiv.org/pdf/2111.12800</a>

评论 #43004580 未加载

shaganer3 months ago

Read this within my half hour break and man, wow what a story. I'm not a software guy, I'm a sys and net guy. Despite not caring or knowing about hash tables, that articles a great read! Thanks for sharing!

varjag3 months ago

tl;dr sublinear worst case query and insertion in hash tables.

评论 #43004785 未加载

评论 #43004794 未加载

seinecle3 months ago

Anyone competent enough here to venture a guess on the speed gain to expect under various scenarios?

isaacfrond3 months ago

The paper is here: <a href="https://arxiv.org/pdf/2111.12800" rel="nofollow">https://arxiv.org/pdf/2111.12800</a>Curiously, Andrew Krapivin, the genious undergrad in the article, is not one of the authors.

评论 #43010756 未加载

评论 #43012063 未加载

评论 #43011393 未加载

reportgunner3 months ago

Sad that the article doesn't say what his approach actually is.

bnly3 months ago

Step one: Be a geniusStep two: Try to solve hard problemsStep three: Avoid reading too much of other people's work in the areaStep four: (Maybe) Invent a brilliant new solutionBut really, really don't skip step one.

jjallen3 months ago

Is it just me or did the article not go in to how the improvement works, just the speed of it?

评论 #43006127 未加载

评论 #43004817 未加载

评论 #43012062 未加载

lupire3 months ago

The older a conjecture is, the more likely it is false.That's why the conjecture resists proof -- there is an counterexample that people aren't seeing.

DeathArrow3 months ago

And we are taught to not try reinventing the wheel!

评论 #43012489 未加载

pizza3 months ago

Just realized that the Mixture of Million Experts paper from last year is similar in some respects to this tiny pointers idea

aqueueaqueue3 months ago

How full is your typical production hashtable?

评论 #43010443 未加载

hoseja3 months ago

Is this just theoretically better O(n) or is there an actually faster implementation somewhere?

victor1063 months ago

> The team’s results may not lead to any immediate applicationsWhy not?

评论 #43012716 未加载

qntty3 months ago

A cool result, but it seems like it should be called a computer science conjecture

评论 #43006002 未加载

评论 #43004912 未加载

EternalFury3 months ago

What’s the time and space complexity of the new approach?

amazingamazing3 months ago

This is a good test because it’s recent. Let’s see if deep research can come up with this result without just copying this.Edit: gpt4, Gemini 2 and Claude had no luck. Human driven computer science is still safe.

评论 #43006700 未加载

评论 #43006120 未加载

评论 #43007326 未加载

hemant10413 months ago

Interesting read!

nickhodge3 months ago

I bet this guy would still fail a first round FAANG developer interview requiring a Hash Table solution to move on in the process."Yeah, sorry. You didn't use the right Hash Table"

评论 #43010036 未加载

ziofill3 months ago

"it is well known that a vital ingredient of success is not knowing that what you are attempting can’t be done." — Terry Pratchett (equal rites)

评论 #43008030 未加载

percentcer3 months ago

"arrowlike entities"

评论 #43012706 未加载

MR4D3 months ago

Reading through this article is like reading a description of the Monty-Hall problem. [0]It's as through the conclusion seems to defy common sense, yet is provable. [1][0] - <a href="https://priceonomics.com/the-time-everyone-corrected-the-worlds-smartest/" rel="nofollow">https://priceonomics.com/the-time-everyone-corrected-the-wor...</a>[1] - 2nd to the last paragraph: "The fact that you can achieve a constant average query time, regardless of the hash table’s fullness, was wholly unexpected — even to the authors themselves."

评论 #43005069 未加载

评论 #43006321 未加载

评论 #43005371 未加载

评论 #43005764 未加载

jheriko3 months ago

i feel this article is missing some detail or incorrect in reporting the actual development here. either that or i am missing something myself...hash tables are constant time on average for all insertion, lookup and deletion operations, and in some special cases, which i've seen used in practice very, very often, they have very small constant run-time just like a fixed-size array (exactly equivalent in-fact).this came up in an interview question i had in 2009 where i got judged poorly for deriding the structure as "not something i've often needed", and i've seen it in much older code.i'm guessing maybe there are constraints at play here, like having to support unbounded growth, and some generic use case that i've not encountered in the wild...?

评论 #43009906 未加载

ascorbic3 months ago

And they wouldn't make him first named author on the paper

评论 #43004948 未加载

评论 #43005225 未加载

评论 #43004965 未加载

ryao3 months ago

评论 #43005354 未加载

travisgriggs3 months ago

This is cool enough. But I find the "celebrification" style of the piece a bit off putting. Did I really need to see multiple posed shots of this young man reposing in various university settings? It's like we need our own version of La La Land to glorify the survivors of computer success to motivate more to participate.

评论 #43007307 未加载

评论 #43006842 未加载

评论 #43007004 未加载

评论 #43007643 未加载

pmags3 months ago

Nice result!<rhetorical> Hmm....I wonder how such research gets funded?... </rhetorical>

评论 #43008153 未加载

评论 #43007481 未加载

jimnotgym3 months ago

Now we have faster data structures we can fill that extra time by writing less efficient code, and loading more pointless libraries. This is the march of computer science.

ChrisMarshallNY3 months ago

As the villain in Scooby Doo always said:"And I would have gotten away with it, if it hadn't been for those meddling kids!"

zombiwoof3 months ago

Take that AI :)

sam0x173 months ago

This is huge, when can we get a rust implementation?

评论 #43004480 未加载

评论 #43004736 未加载

bruce3434343 months ago

Ok so what's the algorithm? Ass article

kittikitti3 months ago

I read through this and I'm not sure if people have heard of dictionary trees for hash tables. Of course, quantamagazine.org has been known to sensationalize these types of things.

评论 #43007984 未加载