Myths about Hash Tables

101 点作者 zengr超过 12 年前

21 条评论

antirez超过 12 年前

Inside the Redis distribution you can find "dict.c" that is a very easy to understand hash table implementation that uses incremental rehashing, that is, when it needs to switch to a bigger or smaller table this happens incrementally in three ways:1) At every operation you perform there is a rehashing step.2) You have a function that you can call to perform a rehashing step.3) There is an higher level function where you can say "rehash for N milliseconds" in case you have some idle time to spend for a good cause inside your program.Another uncommon thing of dict.c is that it supports the "pick a random element" operation with guaranteed O(1) behaviour.EDIT: Now that I look at the source, clearly while the code is easy to follow a design section should be present in the dict.c top-comment. I'll make sure to add this, but long story short:* We use both chaining and incremental rehashing. This means that there is no need to rehash ASAP when the hash table reached the max % of elements allowed. At the same time using the incremental rehashing there is no need to block.* We support safe and unsafe iterators: safe iterators block the rehashing when they are active, so you have the feeling you are accessing the hash table in a read-only way. Unsafe iterators are ok for looping even while an incremental rehashing is in progress, but is not ok to touch the hash table while iterating with an unsafe iterator (otherwise the iterator no longer guarantees to return all the elements, or to avoid duplicated elements).* The hash table is rehashed both ways: if it's too full, or if it's too empty. This way there is the guarantee that a table has always a percentage between a min and max. This guarantees that our random element function is in the average case O(1).* When we rehash, a second table is created, and stuff are moved incrementally from one to the other. When we have two tables, all the lookup / delete operations are run against both the table as the element can be in the old or in the new table. Instead insertions only happen in the new table.Well these are the main ideas I think.

评论 #4599817 未加载

评论 #4600071 未加载

tptacek超过 12 年前

A good article, but two quick thoughts:1. With a good hash function, it's hard to accidentally stumble on the worst-case O(n) lookup cost. But against an adversary, it is usually very easy to have that problem. This has been a well-known security issue for over a decade, and over the last 4 years we're starting to see more exploitation of it.2. The "little trick" at the end of the document that bumps matching items to the head of their bucket chain will usually provide only a marginal speedup (if many of your lookups hit chains longer than 1, your table is too small or your hash function is broken), and that optimization comes with a cost: lookup now modifies the underlying hash table, which is sometimes very undesirable.

评论 #4600114 未加载

评论 #4600086 未加载

评论 #4599814 未加载

kabdib超过 12 年前

I love hash tables. They're my favorite data structure, maybe.A story:Back in the 70s it appeared that every major state university had to do their own version of PL/1. At the University of Maryland it was called PL/UM. I don't know why, maybe it was a testosterone thing ("IBM took a zillion man-years to write PL/1, but look what WE did with three starving grad students, a half-soused half-professor and a keypunch operator!") (Yeah, another version of PL/1, and that doesn't necessarily make the world a better place. Anyway --)PLUM was notorious for being slow (I don't know about bugs, but they were probably there, too), and it was pretty common for students using PLUM to have to grovel for more seconds of CPU time.Five or six years into PLUM's use as a student language, someone discovered that all of the symbols were being hashed into the same bucket.PLUM got amazingly fast (well, faster, at least) after that. But by then anyone with a clue had managed to scam an account on one of the bsd Vaxen and write their stuff in C.Don't take hash tables on faith. Measure, measure, measure.

recursive超过 12 年前

The first two myths aren't really myths.In "The worst case is terrible", the author goes on to explain that the worst case is unlikely. That may be true, but it's still O(n).In "Hash tables become full, and bad things happen", if I'm reading it correctly, the author shows how you can make each hash table bucket a linked list so that each bucket never becomes full. But what happens when you have 1000 items in each bucket? If you keep adding elements and you want to keep reads fast, sooner or later you're going to have to add more buckets, at which time you need to re-hash every element in the table.I'm a fan of hashtables, but there's some rather suspicious handwaving in this article.

评论 #4599834 未加载

评论 #4599651 未加载

评论 #4600063 未加载

评论 #4599650 未加载

评论 #4599615 未加载

评论 #4599608 未加载

评论 #4599780 未加载

bunderbunder超过 12 年前

Just to throw it out there, here's a 6th myth to consider:> Hash tables are fastWhen you're dealing with lots of keys, they're usually the best option out there. But if you're dealing with a small number of keys, a simple array can often be faster. A hash table lookup involves a constant amount of overhead resulting from calculating the key, and possibly also added indirection depending on whether the implementation uses open addressing. If the amount of time that would be spent on a simple linear or binary search of the array is less than (or even not much more than) the time that would be spent on that overhead, then the array will be faster.Of course, the point where the hash table overtakes the array depends on a lot of factors. As usual the only sure-fire way to know which option is better in a particular situation is to measure.

评论 #4599702 未加载

CoolGuySteve超过 12 年前

Hash tables are fantastic, but:Depending on the hash and hash table implementation, large hash tables where elements are spread across several pages in memory can pathologically thrash your caching. You can get similar behaviour to randomly accessing elements in a large array. A cache miss from L2 to RAM is extremely slow and can be about a million times slower if you have to page fault all the way into swap.This is a good example of a case where the standard model of computer that's used in CS proofs starts to fall over because it ignores hierarchical memory and cache locality.Hand wavily, accessing every element once in a large naive hash table will give you O(n) page faults on average whereas accessing every element sequentially in a vector will give you O(n/d) page faults where d is your page size. While asymptotically similar, in real life that constant d can be gigantic.That being said, I'm not sure how the C++ unordered_map handles this, it's just something I experienced when running cachegrind on someone's hand rolled implementation in the past. (I 'fixed' it by making the templated class store pointers, and then locality was greatly improved because malloc's default behaviour.)

评论 #4600539 未加载

评论 #4600860 未加载

评论 #4600195 未加载

chojeen超过 12 年前

> Engineers irrationally avoid hash tables...Really? That seems about as likely as the statement 'carpenters irrationally avoid using screws'.

评论 #4599789 未加载

评论 #4599612 未加载

评论 #4600326 未加载

brey超过 12 年前

Misses the point about B-trees.They're not just 'good for inexact matches' - they're optimised for things like database indexes, to reduce the number of seeks over a block-based device (eg a filesystem) required for a index lookup.

评论 #4599838 未加载

jchonphoenix超过 12 年前

I'm not really sure this article explains any myths about hashtables at all.Aren't all of these statements obvious is you actually understand what a hashtable is?

fusiongyro超过 12 年前

The reason functional programmers prefer trees is because it's much easier to make them persistent (in the functional sense).

afc超过 12 年前

Seems somewhat vague, I think it could be a lot more accurate. I'm a bit surprised to see it here.It claims that it's a myth that "trees are better" than hash tables and just describes trees as better for "finding inexact matches", as in the example "finding all names that begin with B". That's too vague. It's misleading because:1. Trees can't just answer any random vague questions, such as "Find all names that have a P anywhere". That doesn't really describe the advantages of trees. I think a more accurate formulation would be "iterating over a range of values" or "finding values based on some ordering" (eg. find the smallest element bigger than X). Or for walking the entire contents in some predefined order. Sure, this is not an article about trees, but it doesn't take more than two sentences to describe that in a much better manner.2. For any of the former query patterns, trees are actually much better than hash tables. That's definitely not a myth, which this article would sorta have you believe.> Chained hash tables are a very good idea. Problem solved.Wrong. If you don't resize your hash table, the complexity of accesses will degrade linearly as it exceeds its size. If your hash uses, say, 100 buckets each with a linked lists and stores N > 100 values, an access has complexity O(N) (sure, O(N/100) in the best case (assuming perfect spread of hash keys), but that's really O(N)).Finally, "The worst case is terrible" myth part seems somewhat naive. I think it should at least mention the idea of hash table collision DOS attacks. "This doesn't happen in practice" unless there's some attacker out to get you by triggering that. There's plenty of examples of that happening in practice, of systems vulnerable to that. See example presentation here: <a href="http://www.youtube.com/watch?v=R2Cq3CLI6H8&lr=1" rel="nofollow">http://www.youtube.com/watch?v=R2Cq3CLI6H8&lr=1</a> Sure, that's not a reason to switch away from hash tables, but it's definitely something to consider when using them. I get a bad flavor from this simplification of "yeah, this doesn't happen in practice, at least as long as you know what your keys are".

16s超过 12 年前

C++11 has hash backed containers now if you prefer them. std::unordered_set and std::unordered_map are identical to std::set and std::map except they use hash tables rather than red black trees.

评论 #4599904 未加载

评论 #4600412 未加载

ggchappell超过 12 年前

There are some good ideas here, but I must take issue with a number of the statements the author makes.To begin with, it is not a "myth" that hash tables have poor worst-case performance. It is true that this matters little for most applications, and not at all for some, but the worst case is still there. So I would avoid using hash tables, e.g., in safety-critical systems: medical, military, etc. (Or, as I tell my students: don't program a pacemaker in Python.)And we should note that the excuse that datasets leading to poor performance are very rare, went out 20 years ago. When the person providing your dataset is a teenager who's trying to break your system just for the fun of it, rarity does not mean quite what you might think.Second, hash tables do indeed have performance problems when they fill up. True, there are ways around this, and any decent implementation of a mutable hash table will include a strategy for handling these situations.And that suggests a myth the author of this article might believe: that any ol' programmer can & should write their own hash table. I don't agree. Don't write a hash table for use in production code unless you're sure you know what you're doing. And even if you do, someone else's polished & tested hash table is likely to be better than what you're going to come up with.Lastly, balanced search trees do have better worst-case behavior. That mostly doesn't matter, but sometimes it does. (You have my permission to use a Red-Black Tree in your pacemaker code.)

zippie超过 12 年前

I don't believe most engineers brush off hash tables because of the worst case search. Or even storage costs or implementation complexity as the author suggests.My reason for shrugging off hash tables for most requirements: cost/complexity of inserts and deletes. Hash tables are amazing when you do offline index generation (circa Google "non-realtime" search).However, if you ever need to have frequent updates, deletes, or inserts, you need to grow the hash table and in doing so managing the hash table usurps any search gains. For some use cases that's fine - perhaps you only optimize for reads. Hash tables are awesome for read heavy loads. Hash tables are sub-optimal for mixed read/write work loads.On that note, I believe the Red-Black btree is far superior to hash tables for mixed work loads. RBtree's tend to be more predictable and when your application needs the insert/delete (ie, for real time updates) you don't have to worry about having to find a new data structure.On a related note, XFS had to switch their hash over to an rbtree to fix a performance issue related to hashes falling over in costs of insert/deletes:<a href="http://oss.sgi.com/archives/xfs/2010-09/msg00203.html" rel="nofollow">http://oss.sgi.com/archives/xfs/2010-09/msg00203.html</a>Yes there are myths about hash tables but article doesn't touch on the real truths of why engineers avoid hash tables.

dspeyer超过 12 年前

So we have two ways of dealing with overfullness. We can use linked-list buckets, which are expected O(n) or we grow and sometimes have to rehash the entire table as part of an add.

norswap超过 12 年前

I'm jumping on the occasion to ask a review of my own hash table implementation:<a href="https://github.com/norswap/nucleotides/blob/master/src/map.c" rel="nofollow">https://github.com/norswap/nucleotides/blob/master/src/map.c</a> (think also to peruse the documentation and the tests!)I'm open to any remarks and criticisms: performance, code style, etc. Since the implementation is meant to be simple and educational I may not act on all of them, tough.

artsrc超过 12 年前

Hash Tables make iteration order hard to predict.Hash functions rarely are meaningful outside of the context of the collection.The tree interface leads to a simpler world.

vii超过 12 年前

Hashtables are great until they grow to a significant fraction of available memory. Then there is no way to avoid the cost of rehashing or giving up constant time access - and if your program isn't using most of memory then you can run it on a smaller computer!

ishbits超过 12 年前

As a developer I tend to use RB trees in the prototype phase then migrate to a hash table if the performance difference matters.

philluminati超过 12 年前

<a href="http://global.sitesafety.trendmicro.com/" rel="nofollow">http://global.sitesafety.trendmicro.com/</a>

cmccabe超过 12 年前

I really disagree with this article."The worst case is terrible" is not a myth, it's a reality-- and it's spawned several major security vulnerabilities recently. CVE-2012-1150 for example."Hash tables become full, and bad things happen." Yes. Yes they absolutely do. It depends on your implementation exactly what will happen, but in every case, you are in for some pain. If you choose the "linked list per element" route, congratulations-- your hash table will gradually approximate the lookup behavior of an unsorted linked list as the number of elements gets larger. If you choose rehashing, you'll have to deal with long pauses in your program during a rehash. Hello, latency spikes! Incremental rehashing could improve this, but only at the cost of degrading memory locality even further."Trees are better." For most applications, trees really ARE better. Trees are simple to use and they use less memory most of the time. You really should not be using hash tables unless you have a really good idea of how many elements you are going to have, AND you believe that they will not be accessed in any particular pattern."Hash functions are slow." I don't even know what this statement is supposed to mean. But I do know that there are a lot of bad hash functions out there, and it's easy to choose a bad one as a novice programmer. The right hash function will depend on your data-- something I note that this article fails to mention."Hash tables use too much memory." A wise programmer once told me, "fancy data structures are only good when N is large-- and N is usually small." That definitely applies here. Another way to look at this is that you need to keep the hash table half empty to keep the good properties that made you choose a hash table in the first place: O(1) lookup, O(1) insert, O(1) delete, etc. So you will just have to pay the memory price.Hash tables can be very useful for certain applications. But in general, they're an optimization that you should make only if you know what you're doing. You should be very clear on why you're using them, how many elements you expect to have, and what your strategy is if you overflow. It's not for beginners. And if you are optimizing, you should also be aware that something like a radix tree or a B-Tree may lead to much better performance, due to its much better memory locality.Just about the only really intriguing use for hash tables on a modern CPU is in lockless data structures, but if you're considering writing one of those-- you already know all the stuff I just wrote.