> <i>We had built a system that generated thousands of lines of logs for every test, with lots of “failures” recorded in them. Things like “tried to initialize FOOBAR_CONTROLLER: FAILED!!!,” but we just ran that code on all machines, even ones without the FOOBAR_CONTROLLER hardware. So no one noticed when another 5 lines of errors popped up in a 2000-line log file.</i><p>This right there is a big red flag. The whole bendy business is bad enough, but here you're actively training people to ignore the wolf cries.<p>Don't allow false failures in tests. The entire test suite needs to be binary: either everything works, or it fails.
The turn in the middle of the article about taking care of yourself is interesting. If you only skimmed the first part of the article you might miss that positive message
How speed limits are enforced in America always bothers me, because there is this great disconnect between planners and everybody else.<p>Planners think of speeds of roads being intrinsic to the design of the road thus if a people are going to fast on a road, you need to change it by narrowing it or putting in bumps or something.<p>The other side of this is to think of speeds of a road as based on whats around it, so if there are a lot of houses on a road, people should go slower so you don't hit people so you put in speed limits.<p>But the problem with limits is that since the road feels faster then the the speed limit, people just go faster then the limit, but since changing the road is a lot more expensive then just putting up speed limits that tool is used a lot less frequently.
There's a version of this in SRE where the performance your system delivers becomes the performance people expect. And then they build their systems to depend on that performance, regardless of what your actual SLA is. Paradoxically, delivering better than the performance you're actually capable of sustaining can set things up to break very badly when something fails "within SLA".
The article draws on the definitive text in this area by Diane Vaughan[0]. Read her work on the Challenger Launch Decision - it goes into the details of why the deviance was normalised. Even down to the level of how important decision making conference calls marginalised technical inputs.<p><a href="https://en.wikibooks.org/wiki/Professionalism/Diane_Vaughan_and_the_normalization_of_deviance" rel="nofollow">https://en.wikibooks.org/wiki/Professionalism/Diane_Vaughan_...</a>
> They put their passwords in their wallet and in their phone.<p>The author is underplaying the problem here. There were tests that showed burns through the o-rings and the reports rationalized the danger-- not by normalization of deviance but through deceptive language.<p>It's a lot more like having an audit that shows that no users were observed writing a password on a sheet and putting it in their wallet. And since extant passwords sheets stored in wallets don't match an idiosyncratic definition of "written down" they pass the audit.<p>That's not to say that normalization of deviance didn't happen. Obviously both it and a more direct type of corruption happened. But I get the sense the author here is trying to cram everything into the former to make a tractable problem out of a messy political situation.
Dan Luu has an article titled exactly the same: <a href="https://danluu.com/wat/" rel="nofollow">https://danluu.com/wat/</a>
Here is an excellent talk on this subject[0]. It's one of the few presentations I like to watch every once in a while.<p>[0] <a href="https://www.youtube.com/watch?v=Ljzj9Msli5o" rel="nofollow">https://www.youtube.com/watch?v=Ljzj9Msli5o</a>
'''The crew probably survived in the reinforced cabin until it struck the ocean.'''<p>I went cold reading that.<p>I assumed the explosion took the whole shuttle out instantly.
This unintelligible scree doesnt even get the challenger disaster correct. The challenger disaster has almost nothing to do with engineering but has everything to do with management and politics.