> If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil.21<p>It's fun the the engineering at Google is so great at recognizing things, while the product/"human" teams (like whoever came up with the account reviews and other parts) seems to suck so much.<p>If YouTube applied the same view of what should/shouldn't be automated, they could solve the problem of peoples YouTube channels being locked in front of them, even if they don't break any ToS's.
You should probably not automate <i>all</i> toil. You should only automate toil when the toil cost/effect is more burdensome than the cost of automating it. All automation has a cost, and may or may not create value. Automation should have a positive and timely return on investment. If the ROI is 10 years down the road, you probably shouldn't automate it (yet). If there is a cheaper way to deal with the toil, explore that avenue first.<p>Several times in my career I worked on projects to reduce toil. Sometimes the project would fail because the time it took to work on them went well past the cost saving estimation. Sometimes they would be completed, but the value created was far less than their cost. And sometimes automation wasn't even the solution, and we just needed to change our process or system, or do some other manual thing that reduced the toil cost. Sometimes we chose to automate toil because we were afraid to take on a larger project we knew would make the toil unnecessary, so we paid for the automation and then later for rebuilding everything. Or toil was used as an excuse to justify a project that didn't really have to do with toil.<p>One of my biggest mistakes as an engineer was assumptions I made about my work that ended up creating more waste than value. Talk to an outsider about your plans and why you're doing it, take their advice seriously. And if your automation is optional, make sure you have buy-in before you start working on it; i've sunk months on things that nobody ended up using.<p>A great way to automate toil is incrementally. Typically you have a runbook with step-by-step instructions, and over time you automate one step, then another, etc. The investment is minimal and gradual, it can change over time, and you can target the costliest parts of the toil, optimizing value.
> Among the many reasons why too much toil is bad<p>They missed the big one : human error is a common point of failure. Some of the big outages on GCP were due to ops configuration changes. Gitlab wiped their prod DB one time. KnightCapital suffered death by config error..etc.
Most of my work for SRE was the opposite; I did things manually because the automated systems were guaranteed to mess up some fraction of things. At some point my managers wanted me to automate a hardware management process- I checked and it would take 6 months to deploy the code to prod. Instead, I identified all the broken machines and filed tickets manually- getting things fixed far more quickly without a high rate of false positives and churn (google's hardware repair system churns a lot).<p>Many of the automated systems at Google were developed by geniuses. Others, not so much, and it ended up making a lot of work for other people.
While I'm not against eliminating toil, this article does not seem to consider the negative aspects of automation, such as the deskilling that happens naturally.<p>"This plant basically runs itself, but we do have a human present for if something goes wrong".. 50 years down the line, something goes wrong and nobody has the kind of insight and familiarity with the system that they'd have had it had been manually operated.
In some organizations, including mine, toil is sometimes "reduced" by saying "not my problem" and push it to other teams. It sucks to be on the receiving end of it.
It naturally applies not only to SRE. Toil is a great equalizer - if 80% of say development work is basically toil then a 10x developer is not really that distinguishable from nor useful more than an 1x (Amdahl's law so to speak :) Amount of toil (and not say failed projects/etc.) seems to be a one of the main factors separating the companies with revenues $2M/year/head like Google from the ones with mere $300K/year/head, and one of the best things a mid/low performing company can do is to reduce toil - though usually on practice any such attempt means something like MBA-style "efficiency improvement" measures and processes which add even more toil.
I'm so glad that toil has been automated.<p>Now all I need to do is learn this new Domain Specific Language and find all of the exact configuration parameters to express my specific needs. Oh except this tool has leaky abstractions under it, and those tools also have their own DSLs and configuration parameters. And the tools under those do, too. It's all turtles, all the way down.
Good read, good reasoning.<p>Just a bit sad that <i>someone</i> at Google seems to have read this and focused on the "Automatable" part going "but that includes basically everything we do!"<p>cf youtube/contentId, cf account blocking, cf customer "support", ...
This reads like a positive framing of Jacque Ellul's critique of technique:<p>> The characteristics of the technical phenomenon are Autonomy, Unity, Universality, Totalization. Technique obeys a specific rationality. The characteristics of technical progress are self-augmentation, automization, absence of limits, casual progression, a tendency toward acceleration, disparity, and ambivalence. [1]<p>Supposing the harm Google does (e.g. ambivalence towards individuals harmed by algorithms) is a direct result of this totalizing impulse, maybe it's time to question some of the fundamental assumptions present within.<p>1. <a href="https://ellul.org/themes/ellul-and-technique/" rel="nofollow">https://ellul.org/themes/ellul-and-technique/</a>
This article was nice. I wonder if it can be generalized to careers in general ?<p>Long, satisfying careers often involve proactive, design-oriented approach rather than purely reactive.<p>The only way to make grunge work an entire career would be if you’re constantly doing something for the first or second time, eg artists, novelists.<p>Even scientists, they can initially discover something significant, but they keep repeating the work on the same topic without more depth or breadth, the work will become tool.
Knew I recognized some of this writing before. This book is quoted in an annual letter[0] from Zack Kanter which is also worth a read:<p>> <i>Eliminating toil allows people to focus on the inherent complexity of the difficult, interesting problems at hand, rather than the incidental complexity caused by choices made along the way.</i><p>> <i>Toil can be eliminated...by drawing the system boundary a bit differently. When we use an external service instead of an external library, we’re moving the code outside of our system – thereby outsourcing the entropy-fighting toil to some third party. Not our entropy, not our problem</i><p>[0] <a href="https://www.stedi.com/blog/excerpts-from-the-annual-letter" rel="nofollow">https://www.stedi.com/blog/excerpts-from-the-annual-letter</a>
Engineers automating themselves. This is why we should be kinda scared of software innovation stagnating. If we don't work on innovation we don't really have a purpose and a job.
Coming from a software engineering perspective there is a certain amount of toil which is impossible to automate away. CI break-fix issues often depend on the surface area of your software as it interfaces with third parties, including the CI system itself. In some cases that surface area can be large and break-fix takes up a considerable amount of time, but that toil is not _repetitive_ and is _necessary_ table stakes based on the system.<p>And this is after having someone who is extremely aggressive with automation and empowered to do whatever they like to reduce that surface area working on the system. I've taken codebases and hacked out 60% of the lines of code in order to remove brittle external surface area along with unnecessary requirements and contain the project better within its own boundaries and stop repetitive issues. I've taken clever ideas that someone had 5+ years ago out behind the barn and shot them in order to reduce total surface area.<p>But people can walk into an area with a lot of toil going on and go "oh, I know all the strategies on how to reduce this, I will explain to these people who clearly aren't as clever as me how to do it" without realizing that there's often a minimum level of toil for a project which you can't effectively reduce. There's a nonzero vacuum expectation value of toil in any project, and in some cases it can be quite large. Inherently.<p>I don't know how many managers I went through who would come and decide to document all the different failures we were having and spreadsheet them and look for the patterns to address them. And every week there would be 2-3 that would come up and they'd struggle with the fact that there was really no pattern, other than that the project inherently touched many different third parties, because it really HAD to, and that those third parties would change, which would then force interrupt driven toil.<p>There's some point where you just have to hire more people and spread it out. There's no magical incantation to manage your way out of additional headcount.<p>And I don't think the OP article even touched on re-enginering to reduce surface area and brittleness. Automation isn't the only answer to toil. You can automate restarting a service if it crashes, but its always better to just fix the bug (which may involve fixing architectural issues) and make it stop crashing in the first place.
The amount of time needed for a process requiring a number of code reviews, approvals for code style and and architecture ones.<p>Eliminating toil costs lot of time from every engineer.