Myths about Hash Tables

86 pointsby ptrover 8 years ago

13 comments

Udoover 8 years ago

While the article is a bit polemic, I really wish more people knew how hash tables work under the hood.When I was looking for a job, I flunked out of one interview where I got into an argument with an interviewer about the runtime properties of hash tables - more specifically: the interviewer was adamant their access time was always O(1), wanted me to admit my "mistake" and move on, but I just couldn't ;) It was one of those rare cases where they give you feedback on why you're not being hired, too. The lady on the phone said my "technical knowledge" was not up to par with what they needed for the position. Yeah, I'm still salty about that.

评论 #12612168 未加载

评论 #12611737 未加载

评论 #12612268 未加载

评论 #12611919 未加载

seanwilsonover 8 years ago

> Engineers irrationally avoid hash tables because of the worst-case O(n) search time.Not sure about other people here but I haven't heard any developers I work with avoid hash tables for reasons like this. I find most coders treat hash tables as black boxes that offer instant lookups.Also, I found from giving interviews most developers can't explain at a basic level how a hash table works as well as other fundamental data structures like linked lists.

评论 #12611482 未加载

评论 #12611481 未加载

评论 #12612375 未加载

acidbaseextractover 8 years ago

A whole post about hash tables and he doesn't mention the real reason trees are better than hash tables: lower variance!> Hash tables become full, and bad things happenThe bad thing is not a crazy long probe or even rehashing to enlarge the table—it's that some random individual insert is stuck eating the entire cost of rebuilding the table. For many applications, I'm happy with a slightly higher cost per operation in return for predictable performance.> Hash functions are slowGreat, another unmentioned footgun - hash functions are hard. For example, the hash function he provides doesn't use a prime number of buckets. Oops, non-uniform input data in combination with unlucky bucket counts can generate high numbers of collisions.Writing a comparison function to put your data in a tree is pretty stupidproof!

评论 #12611757 未加载

评论 #12611516 未加载

emodendroketover 8 years ago

I'm all for learning for hash tables work, but is avoiding them really widespread? Especially in dynamic languages I feel like dynamic arrays and hash tables are like 90% of the data structures used.

评论 #12612467 未加载

saw-lauover 8 years ago

Interesting that the author offers the chained hash table as an alternative implementation - that was always how I was taught they worked back in school. (1986?)

评论 #12611946 未加载

评论 #12612226 未加载

josefxover 8 years ago

The worst case was once used to attack Java based servers, using colliding post parameters I think. Since then the Java HashMap sorts colliding entries if the entries implement Comparable.

评论 #12611335 未加载

评论 #12611318 未加载

评论 #12611401 未加载

评论 #12611601 未加载

评论 #12611641 未加载

Annatarover 8 years ago

Hash arrays are great; I use them all the time in AWK. All searches using hash indexed arrays in AWK are O(1).I believe that the biggest problem around using hashes as indices entails people wrapping their head around the concept that the array index is a string (for all intents and purposes), rather than a number of a field in an array (or an address of a region in memory, which it eventually is anyway, once it runs). For example, I used entire lines of text as the hash for an array. My colleague who I was showing this technique to was completely dumbfounded as to how that worked. Why was I only storing the value "1" into Array["ORA-1234: blablabla"]?<pre><code> if(Array[$0] != 1) { # # This record is different, so print the entire record. # print; } </code></pre> "How is that used to filter out lines which are identical the ones we have in the canonical file?" He seemed completely flabbergasted. He just couldn't wrap his head around this concept, and ended up re-implementing the entire thing using several lines of grep -v, sed, and if [ ...] then. I think there were even several temporary files in the game, whereas the hash array technique in AWK loaded both the canonical file and the input from stdin into memory, rather than using intermediate files.

krylonover 8 years ago

One interesting alternative to hash tables that appears to get too little love is the Judy tree: <a href="https://en.wikipedia.org/wiki/Judy_array" rel="nofollow">https://en.wikipedia.org/wiki/Judy_array</a>They are not quite as versatile, I think, but for simple cases where your keys are strings or ints, they work just as well and are quite fast, too. The API is very simple, too.I don't do much programming in C these days (pretty much none at all), but I used Judy in one toy project and have fond memories of it.

评论 #12612975 未加载

评论 #12612276 未加载

greg7mdpover 8 years ago

If you are interested in hash tables, check out my improved version of Google's already excellent sparsehash at <a href="https://github.com/greg7mdp/sparsepp" rel="nofollow">https://github.com/greg7mdp/sparsepp</a>. Excellent performance, very low memory usage, grows as needed of course.

personjerryover 8 years ago

This seems like Data Structures 101

stefsover 8 years ago

potential offtopic question here: i always wondered about multi-layered hashtables, i.e. instead of a linked list for resolving collisions use a second hashtable with a different hash function (and then maybe a linked list on the 3rd level).i guess this might bring some potential improvements in case of a hashtable with too many collisions but a prohibitively expensive overhead in all other cases. i.e. it might alleviate a hashtable that was already broken.

评论 #12614349 未加载

huevingover 8 years ago

He mentioned his own home grown hash with the bit shifts. How does it compare to common ones that use an AES instruction if the crypto chip is available?

评论 #12611710 未加载

stefsover 8 years ago

i think the article omits one interesting point - cache coherence. afaik this is the one reason otherwise inefficient data structures might result in better real world performance for small data sets.

评论 #12613701 未加载

13 comments

Udoover 8 years ago

评论 #12612168 未加载

评论 #12611737 未加载

评论 #12612268 未加载

评论 #12611919 未加载

seanwilsonover 8 years ago

评论 #12611482 未加载

评论 #12611481 未加载

评论 #12612375 未加载

acidbaseextractover 8 years ago

评论 #12611757 未加载

评论 #12611516 未加载

emodendroketover 8 years ago

I'm all for learning for hash tables work, but is avoiding them really widespread? Especially in dynamic languages I feel like dynamic arrays and hash tables are like 90% of the data structures used.

评论 #12612467 未加载

saw-lauover 8 years ago

Interesting that the author offers the chained hash table as an alternative implementation - that was always how I was taught they worked back in school. (1986?)

评论 #12611946 未加载

评论 #12612226 未加载

josefxover 8 years ago

The worst case was once used to attack Java based servers, using colliding post parameters I think. Since then the Java HashMap sorts colliding entries if the entries implement Comparable.

评论 #12611335 未加载

评论 #12611318 未加载

评论 #12611401 未加载

评论 #12611601 未加载

评论 #12611641 未加载

Annatarover 8 years ago

krylonover 8 years ago

评论 #12612975 未加载

评论 #12612276 未加载

greg7mdpover 8 years ago

personjerryover 8 years ago

This seems like Data Structures 101

stefsover 8 years ago

评论 #12614349 未加载

huevingover 8 years ago

He mentioned his own home grown hash with the bit shifts. How does it compare to common ones that use an AES instruction if the crypto chip is available?

评论 #12611710 未加载

stefsover 8 years ago

评论 #12613701 未加载