The Discovery of Apache ZooKeeper’s Poison Packet

261 pointsby gregoneabout 10 years ago

15 comments

mhwabout 10 years ago

What strikes me about this investigation is that it clearly demonstrates the value of having access to the source code for your full stack, and also openness in the specification of how the protocols work and how everything fits together. Count the number of different software components from separate organisations that this investigation required information from:* Apache ZooKeeper* The Apache Curator ZooKeeper client library, originally from Netflix* The Linux kernel* Arguably the authors of RFC 3948, which specifies the protocol and expected behaviour of the networking components* The Xen hypervisorNow imagine that each of these components was closed, proprietary code from a separate organisation and you had to rely on support from each supplier to get to the bottom of the issue. It's unlikely that the customer would be able to successfully identify the issues without access to the source code. But at the same time it is unlikely that any individual supplier would be able to identify the problem as none of them can see the full picture either.

评论 #9510816 未加载

bd_at_rivenhillabout 10 years ago

All kinds of badness here. Bug #2 really reduces the level of comfort I would have with using ZooKeeper as a tool.First of all, the default Java policy of terminating the thread, instead of the process, when a runtime exception is not handled is fully boneheaded and the first thing you should always do in a server program is to set a default uncaught exception handler which kills the program. Much better to flame out spectacularly than to limp along with your fingers crossed hoping for the best, as this bug amply demonstrates.On the heels of that, there's this: "Unfortunately, that means the heartbeat mechanisms would continue to run as well, deceiving the followers into thinking that the leader is healthy." Major rookie mistake here; the heartbeat should be generated by the same code (e.g. polling loop) which does the actual work, or should be conditioned on the progress of such work. There's no indication that ZooKeeper is bad enough to have a separate thread whose only responsibility is to periodically generate the heartbeat (a shockingly common implementation choice), but it is clearly not monitoring the health of the program effectively.Suffering a kernel level bug is outside the control of a program, but this demonstrates a lack of diligence or experience in applying the appropriate safety mechanisms to construct a properly functioning component of a distributed system.

评论 #9511481 未加载

评论 #9511829 未加载

评论 #9511973 未加载

fragmedeabout 10 years ago

I'm impressed. Not by the first 90%; that's par for the course (and why I'm a big believer in open source). No, what impresses me is the sharing of their solution. A solution, which, in their own words, is 'not a proper fix'. The engineer in me would be embarrassed by the fix but you can't fault that it works. There's a workaround, so the business perspective is any further time spent fixing this could be spent working on new features instead.Still, I have a bunch of unanswered questions. Why not just upgrade all the hosts to Xen 4.4? Does recompiling the 3.0+ kernel without the bad 'if' in /net/ipv4/esp4.c make the problem go away? Does the problem happen if there's only one VM on a host? Of the seven AES-NI instructions, which one is faulting? How often does it fault? The final question though, is what causes it to fault?

评论 #9512119 未加载

评论 #9513483 未加载

deathanatosabout 10 years ago

When I read reports like this on HN I am absolutely floored by the level of detail and quality work put into not only writing them, but by getting to the bottom (well, almost!) of the problem! Fantastic work. How do you do it? My server-side team is ~5 engineers (and 1 devops), and we struggle just to keep up with the incoming feature requests, let alone do work on improving the infrastructure, and even further, let alone have an engineering blog, or do this kind of research or work. Is there a good way to foster the culture that this is something that should be held as important?

评论 #9510621 未加载

评论 #9510478 未加载

评论 #9512306 未加载

huevingabout 10 years ago

>While checksumming is a great way to detect in-flight corruption, it can also be used as a tool to detect corruption during the formation of the packet. It is the latter point that was overlooked, and this optimization has come to bite us. Lack of validation here let our mystery corruption through the gate – giving ZooKeeper bad data which it reasonably believed was protected by TCP. We claim this is a bug – intentional or not.This would only be a false sense of security. It only tells you that there was a bug in the formation of the packet after the checksum was calculated. If your failure case assumes a system can't put together a packet, how can you assume that it even makes it to the checksum calculation step correctly?Edit: There is also another downside to enabling TCP checksumming after decryption. It eliminates the possibility of hardware TCP checksum offloading so you would be paying the performance cost of software checksumming every packet. This is why the RFC was written that way to begin with...

userbinatorabout 10 years ago

AES-NI instructions need to use the XMM registers. My guess is that someone forgot they have to be saved/restored when AES-NI has been used.There have been a few Xen bugs around saving/restoring state which leads to information disclosure (one VM can read values in registers from another), but another manifestation of this type of bug is corrupted registers.

评论 #9512540 未加载

twicabout 10 years ago

There are a number of alternatives to ZooKeeper (etcd, Consul, etc). There are a number of systems which specifically require ZooKeeper (Kafka springs to mind).How plausible would it be to replace hard dependencies on ZooKeeper with a dependency which could be fulfilled by any of the alternatives?For example, could there be a standard protocol? Could someone implement the ZooKeeper protocol on top of Consul? Could we define a local client API with pluggable implementations for each alternative (like ODBC etc)?Or are the semantics of the 'alternatives' just too different?

评论 #9511090 未加载

评论 #9519454 未加载

tmd83about 10 years ago

A fantastic investigation. These are always fun to read but also scary as in how bugs at so many stacks combined together to generate a problem. So you might end up being the only one to be affected by something and if you don't have the technical expertise to figure it out then you are in a big trouble.

Ygorabout 10 years ago

Zombie ZooKeeper nodes that appear as healthy members of the cluster after an OOM is something that can cause major problems.There are two quick solutions on the operational side that can be deployed to prevent this:- Run each zk server node with the JVM OnOutOfMemoryError flag, e.g. like this: -XX:OnOutOfMemoryError="kill -9 %p"- Have your monitoring detect an OOM in the zookeeper.out log, and use your supervisor to restart the failing ZK node.ZooKeeper is designed to be fail fast, and any OOM should cause an immediate process shutdown, ofc continuing with an automatic start of a new process by whatever is supervising it.

thaumasiotesabout 10 years ago

> Bug #3 – Obscure Behavior> We claim this is a bug – intentional or not.I like seeing this described as a bug. Bug #3 is documented behavior (there's a comment right there in the source calling it out), and it does what it's intended to do, and the relevant RFC says that it should be doing that. It's only a "bug" in the sense that the behavior is nevertheless metaphysically incorrect.I once wrote java code doing bit shifts of arbitrary length as part of a programming contest at my school. It failed on a single test case for mysterious reasons. I eventually discovered that my algorithm was totally correct -- but I had been assuming that if you shift a 32-bit integer by 32 or more bits, you'll get zero. In fact, and this is required by the java specification, you get unpredictable garbage data. More specifically, the java standard mandates that "a << b" be silently compiled to "a << (b % 32)". So I had to rewrite code like this:<pre><code> bit_pattern = bit_pattern << potentially_large_value; </code></pre> into something more like this:<pre><code> // if you shift by x, Java will silently rewrite it as a shift by (x % 32) while( potentially_large_value > 31 ) { bit_pattern = bit_pattern << 31; potentially_large_value -= 31; } bit_pattern = bit_pattern << potentially_large_value; </code></pre> I can imagine no circumstance where this would be useful or helpful in any way. Even later, I found out that the JVM bytecode for bit shifting uses a 5-bit operand, and I surmise that the standard is written the way it is to make it slightly (slightly!) simpler to compile a Java shift statement into a JVM bytecode instruction. I can't call this a bug in javac -- it's doing exactly what it's required to do! But the only way it makes sense to me to describe this is as a bug in the Java specification. If I say to do something 45 times, that should be the same thing as doing it 15 times, and then another 15 times, and then 15 more times. It shouldn't instead be the same as doing it 13 times.

评论 #9511079 未加载

评论 #9511894 未加载

评论 #9511730 未加载

评论 #9511539 未加载

karmakazeabout 10 years ago

TL;DR 4 bugs: 2 in ZK, 2 kernel. IPSec/NAT-T, AES-NI, Xen PV/HVM. A fun (7 scroll) read.

评论 #9510576 未加载

评论 #9510579 未加载

jamesblondeabout 10 years ago

There are alternatives to Zookeeper. Here's our work on using a NewSQL DB (NDB) to do leader election. Instead of using ZaB as a replication protocol, it uses transactions with low TransactionInactive timeouts (5 seconds), placing a ~5 second upper bound on leader election time. The upper bound increases dynamically if the number of nodes is >100. <a href="http://www.jimdowling.info/sites/default/files/leader_election_using_newsql_db.pdf" rel="nofollow">http://www.jimdowling.info/sites/default/files/leader_electi...</a>

评论 #9510439 未加载

noonespecialabout 10 years ago

Whenever I hear of deep mysterious bugs found in critical encryption systems, the conspiracy theorist in me can't help but wonder...Did they stumble upon a very well hidden back door? This might not be the last we hear of this.

an_account_nameabout 10 years ago

Is there a CVE for this? This seems like it probably has security implications.

评论 #9510393 未加载

DonHopkinsabout 10 years ago

Isn't all that addressed in Zookeeper 2: Zookeepier?<a href="https://www.youtube.com/watch?v=_F-RyuDLR4o" rel="nofollow">https://www.youtube.com/watch?v=_F-RyuDLR4o</a>