Why is DevOps is so hard in other companies yet so easy in Netflix? I mean, the entire monitoring stack of Netflix was automated by one person and managed by four. Two people in Netflix used to manage their entire messaging pipeline, including suro, Kafka, Zookeeper, and some Druid stuff. Their Hadoop ecosystem used to have a handful of people and their end-to-end ingestion pipeline that aggregates and demuxes data into Hive with 15-minutes of maximum delay was written and maintained by a single person. Their key manager, well before Amazon KMS existed, was implemented and maintained by a single person. Their engineers in the cloud platform team were oncall 24x7 yet they only occasionally got paged. Oh, their cloud platform team had fewer than 20 people too. Their entire Cassandra team used to have fewer than 20 people. They predictively autoscaled their clusters with a system created and maintained by merely 2 or 3 people. According to their engineers, they did a few things on top of the excellent foundation of AWS, and they <i>just did them without any fuss</i>:<p>- API-based full transparency. If there's a function, there is an API. Anything you can do is explicit in API.<p>- Powerful monitoring support so "instrumentation to death" is not just a slogan.<p>- Decentralized control. For instance, teach team has freedom to decide how to load balance their traffic and how to handle unresponsive nodes.<p>- Assuming everything can fail and build mechanisms to account for that assumption. Chaos engineering. Autoscaling from get go (again, thanks to Amazon for such a powerful feature from day one in EC2).<p>- A shared culture: don't tell me to learn your shit. So, no friction from handling shit like Puppet, like Chef, like Terraform, like HCL, like whatever yaml or DSL that folks on HN passionately advocate. Like learning how to embed a jinja template in a job description with 5000 lines of Ruby code plus some system-specific half-assed DSL just to update an environment variable? Never gonna happen. Don't get me wrong: they are probably nice technology. They are just too damn low level and irrelevant to most engineers.<p>- Instant gratification. If I make a config change, I want to see it in production in seconds, safely. If I make a change in my code? The change will deploy with all the guardrail in seconds. So, a puppet change that takes on average 15 minutes to materialize? That's just garbage.<p>- No surprise. So, a Chef script goes behind my back to update my OS and screws up my production service? Never happens. Immutable infrastructure was implemented from day one.<p>So the question is, why are those thing hard in other companies? Do engineers enjoy getting paged at 2:00am? Do they enjoy spending at least 1/3 of their time handling so-called operations? Or do they enjoy writing rants like DevOps is a failure?