TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The O-Ring Theory of DevOps

70 pointsby r4umover 9 years ago

6 comments

exeliusover 9 years ago
I don&#x27;t know if this is O-Ring theory as much as a &quot;six sigma&quot; style process. The quality of each output depends on the quality of the inputs, and imperfections grow exponentially when things fall out of tolerances.<p>The whole point of DevOps is really to get your software developers to share in the pain that certain design &#x2F; development choices can cause. In the absence of some DevOps-like process, dev teams will naturally shovel all their tech debt on the operations team. Rather than build their software to handle multi-tenancy, they&#x27;ll just say &quot;Eh, deploy another VM with a different configuration&quot;. Not that that&#x27;s necessarily the wrong answer, but because it&#x27;s less work for them, they don&#x27;t take into account how hard it will be for the operations team to manage. If the dev team is responsible for both building the software, the tools used to manage the deployment, and responsibility for incident resolution, they&#x27;ll make decisions that are better for the company and its customers.<p>That&#x27;s my take, anyway.
评论 #10556986 未加载
评论 #10556403 未加载
评论 #10557398 未加载
memracomover 9 years ago
This is what people mean when they talk about &quot;technical debt&quot;. Many of us have realized that when mistakes are not fixed early, we end up doing 10x, 20x, and even 200x the amount of work in dealing with the consequences of those mistakes.<p>In Software Development good tools like an IDE with static code analysis and good practices like unit testing + integration testing + functional testing + automated acceptance tests help us to avoid that 200x downside.<p>DevOps is not all software development, but part of it is, and so it should respond in the same way to good tools, and good test practices. For the rest of it, there are lots of old practices that do make a difference such as having regular backups of everything, having live-live standby servers, hiring a DBA to clean up the mess in your data models, NEVER patching the database except using code that has been unit tested and which generates a script to undo everything that it has patched along with a hash check of the database to guarantee that the undo step worked. And so on.<p>Maybe the cute moniker &quot;O-ring economics&quot; will help to explain it to management types better but the jury is still out on that, IMHO.<p>But what manager would NOT want everybody to work as a team, to learn from their mistakes, and to constantly up their game, as a team? Call it Agile or call it something else, but this is what is proven over the last 70 years or so (perhaps even longer).
jjuhlover 9 years ago
Excellence in development and exellence in operations does not necessarily overlap. IMHO the whole &quot;devops&quot; term is bogus since it&#x27;s just the worst of both worlds.<p>You want experts at both ends of the spectrum, not a developer who can do some sysadmin tasks but not too well or a operational engineer who can write some code but not too well. Acknowledge that there are different fields and you need experts in both. Forget about aiming for a hybrid - it doesn&#x27;t work. You&#x27;ll just get incompetents who are able to pose as being good at both sides.
评论 #10556087 未加载
评论 #10557314 未加载
poweraover 9 years ago
Uh, that isn&#x27;t the &quot;O-Ring Theory&quot; I would have expected.<p>The Challenger failed because, while the O-Rings were within tolerances, they shouldn&#x27;t have been varying <i>at all</i>. So because they were varying unexpectedly, they inevitably failed.<p>I would claim the corresponding devops theory is &quot;measure everything, know what is measurement noise and what is user activity, and eliminate everything else before it causes an outage&quot;.<p>(from Wikipedia: &quot;In one example, early tests resulted in some of the booster rocket&#x27;s O-rings burning a third of the way through. These O-rings provided the gas-tight seal needed between the vertically stacked cylindrical sections that made up the solid fuel booster. NASA managers recorded this result as demonstrating that the O-rings had a &quot;safety factor&quot; of 3. Feynman incredulously explains the magnitude of this error: a &quot;safety factor&quot; refers to the practice of building an object to be capable of withstanding more force than the force to which it will conceivably be subjected. To paraphrase Feynman&#x27;s example, if engineers built a bridge that could bear 3,000 pounds without any damage, even though it was never expected to bear more than 1,000 pounds in practice, the safety factor would be 3. If a 1,000 pound truck drove across the bridge and it cracked at all, even just a third of the way through a beam, the safety factor is now zero: the bridge is defective.&quot;)
评论 #10556752 未加载
bmh100over 9 years ago
This is very true in the world of business intelligence and analytics. A single breakdown in the data pipeline from source to report&#x2F;dashboard affects the entire system&#x27;s value. &quot;Garbage in, garbage out&quot; is a maxim. If the ETL pipeline does not clean up the data and correct errors, it can make dashboards nearly worthless.
franzpeterfolzover 9 years ago
Let&#x27;s assume the leadtimes in the pipeline differ dramatically. l_1 ... l_n-1 take just minutes and l_n takes some days.<p>Maybe its a huge simulation or learning of a neural network.<p>In this theory the leadtimes are getting normalized and multiplied.<p>But in this pipeline, the process n is overweighting all the other processes, so that the leadtimes of the other processes are almost irrelevant compared to the leadtime of process n.<p>a. Lets assume leadtime of process 1 is missed by 50%. So it takes an additional minute.<p>b. Compared to process n, missing leadtime by 50%, taking an additional week.<p>Expected Quality would be the same for a and b. But the overall leadtime will differ in one week.