This reminds me very much of Sidney Dekker's work, particularly The Field Guide to Understanding Human Failure, and Drift Into Failure.<p>The former focuses on evaluating the system as a whole, and identifying the state of mind of the participants of the accidents and evaluating what led them to believe that they were making the correct decisions, with the understanding that nobody wants to crash a plane.<p>The latter book talks more about how multiple seemingly independent changes to complex loosely coupled systems can introduce gaps in safety coverage that aren't immediately obvious, and how those things could be avoided.<p>I think the CAST approach looks appealing. It seems as though it does require a lot of analysis of failures and near-misses to be best utilized, and the hardest part of implementing it will undoubtably be the people, who often take the "there wasn't a failure, why should we spend time and energy investigating a success" mindset.
Like so many things from Google engineering this will be toxic to your startup. SREs read stuff like this, they get main character syndrome and start redoing the technical designs of all the other teams, and not in a good way.<p>This phenomenon can occur in all “overlay” functions, for example the legal department will try to run the entire company if you don’t have a good leader who keeps the team in their lane.
I wish this article was at most a quarter of its current length. Preferably even shorter. There's so much self-congratulatory and empty talk, it's really hard to get to the main point.<p>I think, the most important (and actually valuable) part is the mention of the work done by someone else (STPA and CAST). That's all there is to the article. Read about Causal Analysis based on Systems Theory (CAST) and System-Theoretic Process Analysis (STPA) do what the book says.
Couple thoughts here:
1. The “rightsizer” example mentioned might well have had the same outcome if the outage was analyzed in a “traditional” sense. That said, it is much easier and more actionable with this new approach.
2. I’ve always hated software testing because faults can occcur external to the software being tested. It’s difficult to reason about those if you have a myopic view of just your component of in system. This line of thinking somewhat fixes that- or at least paves a path to fixing that.<p>Unfortunately, while this article says a lot, much just repeated itself and I’d wish there was more detail. For example: who all is involved in this process? Are there limits on what can be controlled? How (politically) does this all shake out with respect to the relationships between SREs and software engineers? Etc..
I've been reading about CAST (Causal Analysis based on Systems Theory) and noticed some interesting parallels with mechanistic interpretability work. Rather than searching for root causes, CAST provides frameworks for analyzing how system components interact and why they "believe" their decisions are correct - which seems relevant to understanding neural networks.<p>I'm curious if anyone has tried applying formal safety engineering frameworks to neural net analysis. The methods for tracing complex causal chains and system-level behaviors in CAST seem potentially useful, but I'd love to hear from people who understand both fields better than I do. Is this a meaningful connection or am I pattern-matching too aggressively?
The article describes Causal Analysis based on Systems Theory (CAST) which is akin to many-factor root cause analysis.<p>I am a big fan of CAST for software teams, and of MIT Prof. Nancy Leveson who leads CAST.<p>My CAST summary notes for tech teams:<p><a href="https://github.com/joelparkerhenderson/causal-analysis-based-on-system-theory">https://github.com/joelparkerhenderson/causal-analysis-based...</a><p>MIT CAST Handbook:<p><a href="http://sunnyday.mit.edu/CAST-Handbook.pdf" rel="nofollow">http://sunnyday.mit.edu/CAST-Handbook.pdf</a>
I wonder a what scale this very interesting approach start yielding more value than cost.
What I mean is: is it a faang only as so many things they seeded or is it actually relevant at a non-faang scale?<p>I tend to be invest much on risk avoidance, so this is appealing to me, but I know that my risk avoidance tendency is rarely shared by my peers/stakeholders.
From what I gathered skimming over the article and especially spending a bit of time on their example, that the authors and whomever, invented a complicated system into/onto which they try to fit the real world, when in invariably doesn't fit they use band-aids, self reported to fix real-world problems. In their example the righsizer should never have set a wrong size as that should have been described or prescribed properly, thus they failed.<p>Quick connect I made is to when I was learning RDF and its various incantations, and trying to describe real world. I never did figure it out, but did learn it's a very hard problem.
I think the single biggest thing about Google SREs (at least in the early years) was that if your team was going to launch a new product, you had to have an SRE to help and to maintain the service.<p>Google deliberately limited the amount of SREs, so you had to prove your stuff worked and sell it to the SRE to even get a chance to launch.<p>Constraints help to make good ideas better...
It took Google more than 10 years after I showed them the problem with their current approach to service management, which was much aligned with SRE, to get to this point of awareness of the need for service cognition but here we are.<p><a href="https://www.linkedin.com/posts/william-david-louth_devops-sre-servicecognition-activity-7281630811076931585-hjSE" rel="nofollow">https://www.linkedin.com/posts/william-david-louth_devops-sr...</a>
SWEs: are SRE/devops folks part of your day to day?<p>I have never been in a SWE role where I didn’t do my own “ops”, even at FAANG (I haven’t worked at Google). I know "SRE/devops" was/is buzzy in the industry, but it’s always seemed, in the vast majority of cases, to be a half-assed rebrand of the old school “operations” role -- hardly a revolution. In general, I think my teams have benefited from doing our own ops. The software certainly has. Am I living in a bubble?
I don't think Ben Treynor knows what SRE at Google is, anymore. I've heard from multiple sources that he's checked out, retired, and chilling on his ranch.<p>I'm sure there's some team at Google that does this, but this reads like yet another "how Google works" books that nobody at Google recognises.
They're doing that thing that happened to DevOps. It started out as a guy who wanted a way for devs and sysadmins to talk about deploys together, so they didn't get dead-cat syndrome. It ended up as an entire branch of business management theory, consultants, and a whole lot of ignorant people who think it just means "a dev who does sysadmin tasks".<p>Abuse of a single word to mean too many things makes it meaningless. SRE now has that distinction. You've got SREs who (mostly) write software, SREs who (mostly) stare at graphs and deal with random system failures, and now SREs who use a framework to develop a complex risk model of multiple systems (which is more quality control than engineering).
> Looking at a data flow diagram with more than 100 nodes is overwhelming—where do you even begin to search for flaws?<p>Yeah, so maybe try not to build anything that complex to start with.
These days I wonder if Google is really the example to follow. There was a time 10 or 15 years ago where Google seemed to be leading the industry in everything, and I feel like a lot of people still think they do when it comes to engineering culture. These days I tend to see Google as a bit of a red flag on a resume, and I have a set of questions I ask to make sure they didn’t drink too much of the koolaid. Perhaps more importantly, when I look at Google from the outside these days, I see that their products have really gone downhill in terms of quality. I see Google Search riddled with spam, I see Gemini struggling to keep up with OpenAI, Google Chat trying to keep up with Slack but missing the mark, Nest being stagnant, I could go on and on. All this to say that I don’t think Google is the North Star that it used to be in terms of guiding engineering culture throughout the industry.
SRE == Site Reliability Engineering.<p>Quoting Wikipedia:<p>Site Reliability Engineering (SRE) is a discipline in the field of Software Engineering that monitors and improves the availability and performance of deployed software systems, often large software services that are expected to deliver reliable response times across events such as new software deployments, hardware failures, and cybersecurity attacks[1]. There is typically a focus on automation and an Infrastructure as code methodology. SRE uses elements of software engineering, IT infrastructure, web development, and operations[2] to assist with reliability. It is similar to DevOps as they both aim to improve the reliability and availability of deployed software systems.<p><a href="https://en.wikipedia.org/wiki/Site_reliability_engineering" rel="nofollow">https://en.wikipedia.org/wiki/Site_reliability_engineering</a>