As I get older, I find myself enjoying these types of stories less and less. My issue comes from the fact that nobody seems comfortable having a conversation about facts and data, instead resorting to childish analogies about turning knobs.<p>That’s not how our jobs work. We don’t “adjust a carefulness meter.” We make conscious choices day to day based on the work we’re doing and past experience. As an EM, I’d be very disappointed if the incident post mortem was reduced to “your team needs to be more careful.”<p>What I want from a post mortem is to know how we could prevent, detect or mitigate similar incidents in future and to make those changes to code or process. We then need to lean on data and experience of what the trade offs of those changes would be. Asking a test? Go for it. Adding extra layers of approval before shipping? I’ll need to see some very strong reasons for that.
I’ve been in this industry a long time. I’ve read Lying with Statistics, and a bunch of Tufte. I don’t think it would be too much hyperbole to say I’ve spent almost a half a year of cumulative professional time (2-3 hours a month) arguing with people about bad graphs. And it’s always about the same half dozen things or variants on them.<p>The starting slope of the line in your carefulness graph has no slope. Which means you’re basically telling X that we can turn carefulness to 6 with no real change in delivery date. Are you sure that’s the message you’re trying to send?<p>Managers go through the five stages of grief every time they ask for a pony and you counteroffer with a donkey. And the charts often offer them a pony instead of a donkey. Doing the denial, anger and bargaining in a room full of people becomes toxic, over time. It’s a self goal but bouncing it off the other team’s head. Don’t do that.
I like the idea of having an actual 'carefulness knob' prop and making the manager asking for faster delivery/more checks actually turn the knob themselves, to emphasise that they're the one responsible for the decision.
It’s not the right approach. Structural engineers shouldn’t let management fiddle with their safety standards to increase speed. They will still blame you when things fail. In software, you can’t just throw in yolo projects with much lower “carefulness” than the rest of the product, everything has maintenance. The TL in this case needs to establish a certain set of standards and practices. That’s not a choice you give away to another team on a per-feature basis.<p>It’s also a ridiculous low bar for <i>engineering</i> managers to not even understand the most fundamental of tradeoffs in software. Of course they want things done faster, but then they can go escalate to the common boss/director and argue about <i>prioritization</i> against other things on the agenda. Not just “work faster”. Then they can go manage those whose work output is proportional to stress, not programmers.
This is not how I'd expect any professional engineering team to do risk mitigation.<p>This is not hard.<p>Enumerate risks. List them. Talk about them.<p>If you want to turn into something prioritisable, for each one, quantify them: on a scale of 1 to 10 what's the likelihood? On a scale of 1 to 10, what's the impact? Multiply the numbers. Communicate these numbers and see if others agree with your assessment of the numbers. As a team, if the product is more than 15, spend some time thinking about mitigation work you can do to reduce either likelihood or impact or both. The higher the number, the more important you put mitigations into your backlog or "definition of done". Below 15? Check with the team you're going to ignore this.<p>Mitigations are extra work. They add time. They slow down delivery. That's fine, you add them to your backlog as dependent tasks, and your completion estimates move out. Need to hit a deadline? Look at descoping, and include in that descoping a conversation about removing some of the risk mitigations and accepting the risk likelihood and impact.<p>Having been EM, TL and X in this story (and the TPM, PM, CTO and other roles), I don't want a "knob" that people are turning in their heads about their subjective measure of "careful".<p>I want enumerated risks with quantified impact and likelihood and adult conversations about appropriate mitigations that lead to clear decisions.
Fwiw in a real world scenario it'd be more helpful to hear "the timeline has risks" alongside a statement of a concrete process you might not be doing given that timeline. Everyone already knows about diminishing returns, we don't need a lesson on that.
A personal anecdote:<p>One of my guys made a mistake while deploying some config changes to Production and caused a short outage for a Client.<p>There's a post-incident meeting and the client asks "what are we going to do to prevent this from happening in the future?" - probably wanting to tick some meeting boxes.<p>My response: "Nothing. We're not going to do anything."<p>The entire room (incl. my side) looks at me. What do I mean, "Nothing?!?".<p>I said something like "Look, people make mistakes. This is the first time that this kind of mistake had happened. I could tell people to double-check everything, but then everything will be done twice as slowly. Inventing new policies based on a one-off like this feels like an overreaction to me. For now I'd prefer to close this one as human error - wontfix. If we see a pattern of mistakes being made then we can talk about taking steps to prevent them."<p>In the end the conceded that yeah, the outage wasn't so bad and what I said made sense. Felt a bit proud for pushing back :)
This feels like a really good starting point to me, but I just want to point out that there's a very low ceiling on the effectiveness of "carefulness". I can spend 8 hours scrutinizing code looking for problems, or I can spend 1 hour writing some tests for it. I can spend 30 minutes per PR checking it for style issues, or I can spend 2 hours adding a linter step to CI.<p>The key here is automating your "carefulness" processes. This is how you push that effectiveness curve to the right. And, the corollary here is that a lack of IC carefulness is not to blame when things break. It is always, always, always process.<p>And to reemphasize the main point of TFA, things breaking is often <i>totally fine</i>. The optimal position on the curve is almost never "things never break". The gulf between "things never break" and "things only break .0001% of the time" is a gulf of gazillions of dollars, if you can even find engineers motivated enough and processes effective enough to get you anywhere close to there. This is what SLAs are for: don't give your stakeholders false impressions that you'll always work forever because you're the smartest and most dedicated amongst all your competitors. All I want is an SLA and a compensation policy. That's professional; that's engineering.
<i>LT is a member of the leadership team.</i><p>LT: Get it done quick, and don't break anything either, or else we're all out of a job.<p>EM: Got it, yes sir, good idea!<p>[EM surreptitiously turns the 'panic' dial to 10, which reduces a corresponding 'illusion of agency' dial down to 'normal']
In general, I found that when I've told people to be careful on that code path (because it has bitten me before) I don't get the sense that it is a welcomed warning.<p>It's almost as if I'm questioning their skill as a engineer.<p>I don't know about you but when I'm driving a road and there is black ice around the corner a warning from a fellow driver is welcomed.
I did a lot of the work in my 40 year software career as an individual, which meant it was on me to estimate the time of the task. My first estimate was almost always an "If nothing goes wrong" estimate. I would attempt to make a more accurate estimate by asking myself "is there a 50% chance I could finish early?". I considered that a 'true' estimate, and could rarely bring myself to offer that estimate 'up the chain' (I'm a wimp ...). When I hear "it's going to be tight for Q2", in the contexts I worked in, that meant "there's no hope". None of this invalidates the notion of a carefulness knob, but I do kinda laugh at the tenor of the imagined conversations that attribute a lot more accuracy to the original estimate that I ever found in reality in my career. Retired 5 years now, maybe some magic has happened while I wasn't looking.
> TL: If we did that, we’d just be YOLO’ing our changes, not doing validation. Which means we’d increase the probability of incidents significantly, which end up taking a lot of time to deal with. I don’t think we’d actually end up delivering any faster if we chose to be less careful than we normally are.<p>This is a really critical property that doesn't get highlighted nearly often enough, and I'm glad to see it reinforced here. Slow is smooth, smooth is fast. And <i>predictable</i>.
Lorin is always on point, and I appreciate the academic backing he brings to the subject. But for how many years do we need to tell MBAs that "running with scissors is bad" before it becomes common knowledge? (Too damn many.)
The dominant model in project management is "divide a project into a set of tasks and analyze the tasks independently". You'd imagine you could estimate the work requirement for a big project by estimating the tasks and adding them up, but you run into various problems.<p>Some tasks are hard to estimate because they have an element of experimentation or research. Here a working model is the "run-break-fix" model where you expect to require an unknown number of attempts to solve the problem. In that case there are two variables you can control: (1) be able to solve the problem in less tries, and (2) take less time to make a try.<p>The RBF model points out various problems with carelessness as an ideology. First of all, being careless can cause you to require more tries. Being careless can cause you to ship something that doesn't work. Secondly, and more important, the royal road to (2) is automation and the realization that <i>slow development tools cause slow development</i>.<p>That is, careless people don't care if they have a 20 minutes build. It's a very fast way to make your project super-late.<p>I worked at a place that organized a 'Hackathon' where we were supposed to implement something with our project in two hours. I told them, "that's alright, but it takes 20 minutes for us to build our system, so if we are maximally efficient we get 6 tries at this". The eng manager says "it doesn't take 20 minutes to build!" (he also says we "write unit tests" and we don't, he says we "handle errors with Either in Scala" which we usually don't, and says "we do code reviews", which I don't believe) I set my stopwatch, it takes 18 minutes. (It is creating numerous Docker images for various parts of the system that all need to get booted up)<p>That organization was struggling with challenging requirements from multiple blue chip customers -- it's not quite true that turning that 20 minute build into a 2 minute build will accelerate development 10x but putting some care in this area should pay for itself.<p>[1] <a href="https://www.amazon.com/Have-Fun-at-Work-Livingston/dp/0937063053" rel="nofollow">https://www.amazon.com/Have-Fun-at-Work-Livingston/dp/093706...</a>
I like the idea of imagining that we can arbitrarily adjust the carefulness knob, but I don't think it works like that in reality. You can certainly spend more time writing tests, but a lot of the unforeseen problems I've hit over the years weren't caused by lack of testing--they were caused by unknown things that we couldn't have known regardless of how careful we were. It doesn't make for a very satisfying post mortem.
In real life you both can't afford having reputation risk and you need to ship anyway. If you have an incident, guess who's going to be liable – the manager or the name on the commit?<p>Stop negotiating quality; negotiate scope and set a realistic time. Shipping a lot of crap faster is actually slower. 99% of the companies out there can't focus on doing _one_ thing _well_, that's how you beat the odds.
All I see here is an all-too-common organizational issue that something like this is having to be explained to someone in a management role. They should know these things. And they should know them well.<p>If your company is needing to have conversations like this more than rarely—let alone experiencing the actual issue being discussed—then that's a fundamental problem with leadership.
I am probably missing an essential point here, but my first reaction was "this is literally the quality part of the scope/cost/time quality trade-off triangle?"<p>Has that become forgotten lore? (It might well be. It's old, and our profession doesn't do well with knowledge transmission. )
> I mean, in some sense, isn’t every incident in some sense a misjudgment of risk? How many times do we really say, “Hoo boy, this thing I’m doing is really risky, we’re probably going to have an incident!” Not many.<p>Yeah, sure, that never happens. That's why "I told you so" is not at all a common phrase amongst folks working on reliability-related topics ;)