This is part of the Israeli Air Force safety culture: <a href="https://www.talentgrow.com/podcast/episode92" rel="nofollow">https://www.talentgrow.com/podcast/episode92</a><p>"By implementing a unique culture and methodology, the Israeli Air Force became one of the best in the world in terms of quality, safety, and training, effectively cutting accidents by 95%."<p>Near misses and mistakes are investigated and there's a culture supporting the reporting of these and that has resulted in a huge change to the overall incident rate.<p>Software is a little like this as well. The observable quality issues are sort of the tip of the iceberg. Many bugs and architectural issues can lurk underneath the surface and only some random subset of those will have impact. Focus only on the impactful ones without dealing with what's under the surface may not have a material impact on quality.
For some mind-boggling near miss account that no one cared about, see <a href="https://avherald.com/h?article=4b6eb830" rel="nofollow">https://avherald.com/h?article=4b6eb830</a>. See how FAA didn’t even have a record of this.<p>And that’s an extreme case. How many less extreme ones happen?<p>In the US alone, there’s research estimating the number of air fume events is around 2000 per year. The number of <i>reported</i> fume events is less than 10.<p>Mental degradation is insidious. “Just wear an oxygen mas-” what if you <i>forgot</i> about the oxygen mask. You forgot to drop the gear.<p>Extensive training (to the point of automation) and human resilience is perhaps the main reason fumes do not seem to be causing many incidents, but resilience can be individual and training cannot drill into pilots the correct intuitive response to all possible scenarios. In addition, it’s unknown in what number of incidents where the cause is pilot’s mistake that mistake was in turn caused by partial mental incapacitation (that perhaps not even the pilot was aware of).
When deciding whether to do an incident investigation for a near miss, one aspect to consider is whether it was caught by a safety system as designed, or caught by a lucky accident. The latter should be top of the priority list.<p>E.g., Bad package deployed to production. Stopped because it didn’t have the “CI tested” flag: low pri. Stopped because someone happened to notice a high CPU alert before the load balancer switched over: high pri.
"Here's one they just made up: near miss. When two planes almost collide they call it a near miss. It's a near hit! A collision is a near miss."
~ George Carlin [0]<p>[0]: <a href="https://www.youtube.com/watch?v=zDKdvTecYAM" rel="nofollow">https://www.youtube.com/watch?v=zDKdvTecYAM</a>
In <i>The Field Guide to Human Error Investigations</i> Dekker talks about how “number of minor incidents” correlates inversely with “number of fatal incidents” for airlines (scaled per flight hour or whatever). I have forgotten whether this was all airlines or Western only. I wonder if it still holds.<p>The rest of the book is also quite a good read including a fun take on Murphy’s Law that goes “Everything that can go wrong will go right” which is the basis for normalization of deviance: where a group of people (an org whatever) slowly shifts from their performance metrics as they “get away with it”.<p>I wonder how modern organizations fight this. Most critically I imagine warfighting ability can experience massive multipliers based on adherence to design. But also civilian performance to a lesser extent (outcome is often less catastrophically binary).<p>Anyway, I got a lot of mileage out of the safety book wrt software engineering.
A simple decision rule in (personal) aviation is three strikes.<p>e.g., (1) nervous passenger; (2) clouds coming in; (3) running late -> abort.<p>It produces very conservative decisions, and overcomes the drive to just try.<p>But the interesting part is that you then realize how often you are at two strikes. That in itself makes you more careful.<p>Two strikes I would call "noticeable". I wouldn't wait for near-miss events. Then there's a measure of how on-edge we're running.<p>So at work, I just put a red dot on the calendar if it's a day with something urgent and visible to outsiders, or if we're having problems we don't see our way out of. It keeps us from tolerating long stretches of stress before taking a step back, and we usually also do attribution: if x is causing n>4 red days per month, it gets attention.<p>Obviously, varies with context: high-achieving team would be mostly always red internally, but rarely externally.
Useful for software or engineering, of course, but also useful for everyday life - safety, relationships, cooking, etc. People (sometimes) learn from painful mistakes, but rarely learn from the painless ones!
This is the most important thing about riding a motorcycle, too. If you almost crash, you just got lucky. Consider the root causes, don't just be proud you escaped the situation.
I have 10+ year old projects on github and virtually all of them are now ridden with security problems. Or their dependencies are in any case. The alert emails are comprehensive.. This tells me that software is inherently unsafe, it’s just a matter of time until flaws are found.<p>Ok you could say that quality is improving and it’s less and less so but from experience at work I would say that’s wishful thinking and if anything it’s the opposite.
> we should be treating our near misses as first-class entities, the way we do with incidents.<p>That's exactly like in driving. You have to take your close calls seriously and reflect over them to improve your habits of observation.
Working in Big Tech: my colleagues aren't<p>This can be taken both with admiration and distaste. I'll either be rewarded with a lesson or a knife
The article is spot on. It's pretty much what happened when Maersk was hacked within an inch of bankruptcy.<p>The flaw was identified, flagged and acknowledged before it happened:<p>"In 2016, one group of IT executives had pushed for a preemptive security redesign of Maersk’s entire global network. They called attention to Maersk’s less-than-perfect software patching, outdated operating systems, and above all insufficient network segmentation. That last vulnerability in particular, they warned, could allow malware with access to one part of the network to spread wildly beyond its initial foothold, exactly as NotPetya would the next year."<p>But:<p>"The security revamp was green-lit and budgeted. But its success was never made a so-called key performance indicator for Maersk’s most senior IT overseers, so implementing it wouldn’t contribute to their bonuses."<p>Basically, a near miss that wasn't incentivised for anyone to fix.<p>If you're interested in this type of story, it's an absolute thriller to read:
<a href="https://archive.is/Gyu2T#selection-3563.0-3563.212" rel="nofollow">https://archive.is/Gyu2T#selection-3563.0-3563.212</a>
For better or worse a near-miss has zero cost to the org as a whole and thus justifies org level zero investment.<p>That is okay as long as someone is noticing! As stated in the article, these types of near misses are noticed within the team and mitigated at that level so the org doesn’t need to respond.<p>That’s a cost effective way to deal with them, so I would argue everything works the way it should.
Oh, well, surprisingly, it seems this article hadn't been posted here yet?<p>Do enjoy the discussion, and, whatever you do, <i>please</i> don't let the apparent incongruity of "near miss" when it's <i>clear</i> that it should be "near accident" derail the conversation... (insert-innocent-smiley-here)