Real System Failures

175 点作者 nz将近 8 年前

11 条评论

bsder将近 8 年前

Quoting:"For more than 30 years, our design lab has seen that no IC greater than 16 pins (except memory) has worked according to its documentation"That matches the experience of every single embedded engineer I have ever known.

评论 #14769835 未加载

评论 #14781336 未加载

TeMPOraL将近 8 年前

This was an absolutely amazing read.Few lessons I took from it:- "There is no such thing as digital circuitry. There is only analog circuitry driven to extremes." Digital is a pretty leaky abstraction. You can't safely ignore the physical world. In particular, be wary of your digital parts changing into other parts (or new "parts" appearing out of the blue) thanks to physics.- There's so much that can go wrong. I'm in awe of people working on life-critical systems, and of challenges they deal with.- What the fuck is going on with IC durability? The presentation quotes a text from 2013, which says "Commercial semiconductor road maps show component reliability timescales are being reduced to 5–7 years, more closely aligning with commercial product life cycles of 2–3 years." I.e. if your device has modern electronics on-board, it already won't last long because semiconductor devices themselves are expected to naturally fail after few years. This makes me really sad about the state of our technological civilization.- Don't ignore specs you don't fully, 100% honest-to-god understand! Slide 38 is a damning enough description by itself. I'd add that this also applies to bureaucracies and laws - just because you think some rule is stupid, doesn't mean it is. "Move fast and break things" approach has no place where lives (or livelihoods) can be affected.- Even adding a node to a linked list isn't a trivial thing, and has many places in which you can screw it up. This highlights just how much acummulated complexity we're dealing with here.- Life always finds a way... to grow in your electronics and break it.

评论 #14769197 未加载

评论 #14769513 未加载

deathanatos将近 8 年前

On slide 10, faulty hardware introduces a standing wave onto a bus. Two CPUs are nodes, two at antinodes, causing a 2-2 disagreement in the state of the system.Yet, the slide goes on to argue this is a software problem? It was my impression that Byzantine tolerant systems required agreement among ⅔ of the nodes; if the system is split 50/50, how can even a tolerant system not fail? (Or rather, is it the difference between failing gracefully and failing spectacularly, and the slide fails to elaborate on exactly how the system failed? But I don't see how we can expect this to succeed.)

评论 #14767577 未加载

评论 #14767183 未加载

评论 #14766900 未加载

mcguire将近 8 年前

This is a fascinating presentation, including scenarios of real, honest-to-god Byzantine failures.Also, it's a great example of the Edward-Tufte-hating, horribly, hilariously bad PowerPoint NASA presentation style.

评论 #14769112 未加载

评论 #14768123 未加载

评论 #14767255 未加载

idlewords将近 8 年前

It's stuff like this that makes me wonder how big the gap is between disaster planning in large scale computer infrastructure (like AWS), and what will happen when there is an actual major disaster like a large earthquake.The amount of confidence people have in their ability to plan for contingencies seems to go down in proportion to their exposure to hardware. Complex systems are endlessly inventive when it comes to finding ways to fail.

contingencies将近 8 年前

All of these problems could have been found by formal analysis."If only we'd had the human, time, money and organizational support resources to plan ahead more accurately, we wouldn't have made this particular mistake!" That's called the benefit of hindsight, and it's the project manager's classic "told you so". To management it sounds like "give me more budget and a slacker timeline", and to engineering it sounds like "someone wants to use a different one-true-solves-all-problems-solution".Experienced system designers know that the real art is knowing that out in the real world, things will fail no matter how careful you are, so anticipating and detecting both known and unknown failure modes and recovering from them is really the critical need.For an accessible, real world study of how this can be achieved with arbitrarily complex software systems, I can highly recommend reading about Erlang, or alternatively deploying a nontrivial pacemaker/corosync cluster. Most engineers never build a system this resilient in their lifetime, but once you have, you can never look back.

评论 #14778022 未加载

评论 #14769260 未加载

0xCMP将近 8 年前

Interesting thing from this is that he says they should have used more formal analysis to build failure and fault tolerant systems.But how do you formally verify/analyze a system for fault and failure tolerance if the methods of detecting failure and other faults are themselves not enough?e.g. The slide about COM/MON, which I admit I didn't fully understand, seems to be that the solution picked wasn't the very best possible one due to constraints and that failures were not detected that the point they were expected to.I guess you at least would know those are failure/fault points which can not be tolerated or handled somehow and should be watched.

ricksharp将近 8 年前

Ok, I'm not a systems developer (I'm a full stack / cloud developer), so I don't usually work with systems that introduce analog faults (operations in software tend to either succeed or fail with an exception).The only place I have encountered something like this was on an Arduino board where the use of a buzzer was causing a voltage drop that affected the logic of the code. (It appeared that a delay function returned immediately instead of taking 250ms, which sped up the loop.)Question:How do you actually implement Byzantine Fault Tolerance?I found this in Wikipedia:Byzantine fault tolerance mechanisms use components that repeat an incoming message (or just its signature) to other recipients of that incoming message. All these mechanisms make the assumption that the act of repeating a message blocks the propagation of Byzantine symptoms.Is verifying the interpreted input value the primary way to design for Byzantine Fault Tolerance?

评论 #14769730 未加载

sengork将近 8 年前

As as side note this was the first time in over a decade that I've seen Honeywell mentioned or HTML Map tags in use.

评论 #14768640 未加载

Baeocystin将近 8 年前

Here's a working link for the 'magic story' link from slide 29.It's a great bit of hacker lore, if you haven't yet read it.<a href="http://catb.org/jargon/html/magic-story.html" rel="nofollow">http://catb.org/jargon/html/magic-story.html</a>

wmu将近 8 年前

Wow, one of the most interesting things I've seen recently.It shows how lucky an average programmer is. We have to deal with relatively easy issues; we can modify code, recompile, debug and repeat until success. :)