Why Java's Object.hashCode() is unsafe for use in distributed systems

72 点作者 martinkl将近 13 年前

10 条评论

This article is misguided, at least wrt Java. For "distributed systems" object identity needs to be created and ensured programmatically (e.g. by an object id). equals() and hashCode() must be derived from an object's immutable properties and they are as safe or unsafe as you implement them.

评论 #4128529 未加载

jdsullivan将近 13 年前

Please see javadoc for String.hashCode: <a href="http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#hashCode()" rel="nofollow">http://docs.oracle.com/javase/6/docs/api/java/lang/String.ht...</a>It's true that relying on hashCode to remain stable across arbitrary objects is brittle, but this is not the case for String.

评论 #4128542 未加载

andrewvc将近 13 年前

FWIW in JDK 8 hash map will most likely degrade into a lookup on a red-black tree rather than a linked list for buckets with more than 8 collisions.

评论 #4128204 未加载

评论 #4128180 未加载

politician将近 13 年前

So, this is true of the CLR as well. The moral of the story is that you shouldn't assume that hashes are one-size-fits-all. Instead, design your own scheme fit for your own requirements (and no, I'm not suggesting that you write your own hash algorithm).

b7kich将近 13 年前

Two wrongs don't make a right. The fact that "a".hash == "\0a" hash is an issue with the hash function. nothing else. Seeding the hash function so it is unique for each process does not solve the issue.

评论 #4128219 未加载

评论 #4128213 未加载

jcdavis将近 13 年前

Overall an excellent article, but its worth mentioning that as I understand it, String's hashCode() method pre-1.3 or 1.2 (forgot exactly which version they switched over) used a different function, so it can indeed change (as the hashCode() contract allows)

评论 #4127961 未加载

ddlatham将近 13 年前

In fact, in Oracle JVMs Java enum hashCode() values are almost never the same across JVM instances. This has bitten me in the past for Hadoop MapReduce jobs specifically.

ShabbyDoo将近 13 年前

For the system I've been working on for the past year, most domain-ish objects extend from a class called AbstractImmutableObject which implements both hashCode() and equals() as final methods. Then, subclasses are forced to implement a "collectObjects()" method which effectively creates an array of instance variable references (plus autoboxed primitives) for a deep implementation of equality and a guaranteed consistent hash code. Immutability currently is enforced by convention, but I think FindBugs could help out. Presuming that all the instance variables in the graph of AbstractImmutableObjects are primitives or JDK objects with consistent hash functions, the resultant hashcode will be consistent as well.[As an aside, I dislike this sort of utility base class, but it's a small price to pay for avoiding awful bugs related to hashcodes being inconsistent with equals, etc.]I also argue that computing a hash on a serialization of an object graph is problematic as well. What are the rules for computing the hash of a serialization of a set of Strings? Must one sort the Strings before serializing them to ensure that any two Sets with the same values have the same hashcode? Sets are not required to maintain an ordering and equals() on sets ignores order (but not on SortedSet), so I would expect my "distributed" hashcode to be the same for any Set containing the same Strings. Such a scheme couples my serialization scheme tightly with my rules for equivalence. As I like to use serialization schemes written by others when possible and I don't want to eat the cost of serialization whenever I want a consistent hashcode, I'd prefer to decouple these ideas.Recently, I've gone a step further with our domain objects and have written a generator of builders and instances using dynamic proxies (just the out-of-the-box JDK 1.3 era stuff). Developers just define interfaces, and unit tests enforce rules about what objects may be contained within the object graphs. This makes it really simple to implement things like round-trip JSON ser/deser. Because the realm of possibilities is so limited, it's easy to ensure correctness.[As an aside, I'd love to have a DSL for such domain objects and then write a bunch of dynamic pieces parts for serialization, meta-stuff (creating a Guava function in a literate sort of way for any path through a domain object, etc.), builders, etc. I guess this is one step down the rabbit hole of model driven development, but I'm looking for something a bit lighter than, say, AndroMDA).]

moondowner将近 13 年前

Isn't this why we very often override with custom hashCode and equals methods? What's the big deal..Here's from Hibernate's WIKI on these two methods: <a href="https://community.jboss.org/wiki/EqualsAndHashCode" rel="nofollow">https://community.jboss.org/wiki/EqualsAndHashCode</a>

评论 #4130742 未加载

cbsmith将近 13 年前

This article and this discussion is filled with too much "missing it", so I'm writing this wordy diatribe.tl;dr: before writing an article advising people about writing a distributed routing framework, write a distributed routing framework. Also: if your "distributed system" is also a "trusted system", realize that fusing the two together requires some genuine expertise.To paraphrase a comment about something I wrote the other day: "that article isn't even wrong".The protobuf reference was just ridiculous. Protobufs are not a distributed system. They're a serialization framework that is intrinsically copying (append the same Protobuf twice to the a repeated field, and when you read it back, you get two distinct objects).<pre><code> class Demo extends Object implements Cloneable { } boolean alwaysFalse(Demo a) { return a.equals(a.clone()); } </code></pre> No Java programmer anywhere near the level of proficiency where they are writing a distributed routing framework would be surprised by the above. Really, anyone who has bothered to read how equals(), hashCode() or HashSet/HashMap/Hashtable work in Java ought not to be surprised. Nobody expects you can throw copies of arbitrary subclasses of Object in to a hash and expect some notion of what is and is not an equivalent object to be magically discerned by the runtime (and if it isn't clear, learning about IdentityHashMap ought to bring it home).From the article:<pre><code> On strings, numbers and collection classes, hashCode() always returns a consistent value, apparently even across different JVM vendors. It’s like that despite the documentation for hashCode() explicitly not guaranteeing consistency across different processes... </code></pre> The article says "hashCode()" instead of "Object.hashCode()", which just shows the author is missing what is going on. String.hashCode(), Integer.hashCode() etc. are not the same as Object.hashCode and are all documented with semantics distinct from Object.hashCode(). This is by design. It's well established design that predates Java by a good two decades too.The article then goes on:<pre><code> The problem is that although the documentation says hashCode() doesn’t provide a consistency guarantee, the Java standard library behaves as if it did provide the guarantee. </code></pre> No. It behaves as though the users had read the actual documentation for the classes in question before using them and knows better than the library author about what the intended behaviour is.The only surprise in protobufs is that they don't generate their own equals()/hashCode() methods to provide more flexible equivalence relations; it becomes abundantly clear as to why once you try to answer the question "well just how would you do it without adding additional semantics (at the expense of complexity and performance) to the protocol buffer language?".The Java spec is very clear that Object.equals(Object) provides object identity as the default implementation of equality. Java 101 covers the difference in both concept and protocol between identity and equality, with the former immutable while the latter can be user defined using by overriding Object.equals(Object). Next you learn that if you do override, you need to also override hashCode() accordingly. You learn this when you use the relevant classes for the first time.It is outright weird for a Java developer to expect equivalence semantics (as opposed to identity) when invoking equals(Object) or hashCode() unless those semantics have been defined for an object's type.This article is coming at it assbackwards from how anyone who has written a distributed routing framework in Java (or for that matter most other languages) ends up coming at it.If you are writing a distributed routing framework, and long before you start thinking about hash algorithms, you need to define "equality and identity" semantics for your distributed objects. That invariably comes down for way to identify that two keys of some kind represent the same object. You might manage this entirely for the user, but typically you end up providing a user defined hooks because there are plenty of business rules for what is a unique identifier. In Java (as is the case in the afore mentioned Hadoop and Storm), that typically means your keys override Object.equals(Object), often with a framework provided type to tag tighter semantics.When you overload Object.equals(Object), now you have to override Object.hashCode(). Sometimes this is missed during Java 101, but you tend not to get far in the Java world without having learned it. So once you have setup your equivalence relationship, how you correctly tackle Object.hashCode() becomes very, very straight forward.In short: if you have reached the point where you are writing a distributed routing framework in Java that assumes Object.hashCode() is stable across the framework, you were almost certainly screwed long, long before you made that mistake.Eventually, you get to the point where you realize that users might define a really bad hash, and you don't want to fail horribly when that happens. So, you might want to defend against it. Turns out, the JDK actually does defend against poor hashes. HashSet/HashMap/etc. tweak the hash when selecting buckets to avoid hashes with entropy in the wrong places (or more accurately, a lack of entropy in the right places ;-).Then after that, you start to realize that not only are user defined hash functions likely bad in that sense, but if they have user defined data (in this case, "user" is not the developer, but the developer's "user"... so that means "incompetent" is no longer our scary adjective... it's now "untrustworthy") they are probably vulnerable to hash collision attacks. At that point, you are outside the scope of what Java's hashCode() is intended to address, so you really have to go a bit further.There are basically four approaches, and the article appears to miss all of them (or at least glosses over them like the author doesn't see how it all fits together).1) Don't use a hash lookup for user provided keys. Use a tree. Consider a critbit tree. Most assume this sucks until they have either reflected on the matter a fair bit, or have been schooled painfully by experience.2) Don't let the end user define the keys.3) Don't let the end user control/have full insight in to the hashCode() calculation. Typically this involves using an HMAC as your underlying hash function rather than a straight hash. For real security, you use an HMAC with a random "password" (kept from the user) for each hash table.4) Letting the hash go unmodified, but randomizing bucket assignment somehow (a shared secret or a secret per hash) and blocking a user who generates more than N distinct keys (key1.equals(keys2) is false) whose hashCode()'s are equivalent, where N is a fixed number or ideally, a ratio based on their total # of keys.What generally doesn't work is simply using a "better" hash. In some cases a simple secure (and by that I definitely don't mean MD5, and really also not SHA-1 ;-) hash can work. It works only if the user never, ever has a way to determine what the hash of a given key is (that means they can never see it or side effects of it... even in the network stack). It also means your hash function is so computationally expensive that the tree approach is likely far more efficient.What also doesn't work is using a serialization framework to serialize the bytes in a deterministic way (it isn't clear, but it reads like the author might think protobufs are guaranteed to be deterministic, which they aren't). Serialization frameworks are, by their very nature, a horrid way to add unpredictable entropy to your data (and determinism just makes it worse). That is tantamount to padding your data with a secret in a very non-cryptographically sound way. It depends not only on keeping said secret, but also keeping the use of the serialization framework secret (effectively the "cryptographic protocol"). If you think you can keep a secret from an attacker, then save yourself the trouble and just XOR your keys with it before you hash. ;-)In short: at no point along the way in the process should you be encountering problems where you get the wrong idea about how hashCode() works and how to use it properly in your distributed routing framework.