<i>Outages are primarily proportional to change.</i><p>Problem one: Increasing scale<p>If a company is growing, increasing scale forces change. It forces change to core systems, like upsharding database clusters. It pushes the limits of various systems in ways that require architecture change. If a company is not growing, that is a major motivator of change not happening and therefore outages that will never happen.<p>Problem two: Adding features<p>If a company stops adding new features, new code doesn't really need to be pushed all that often. Bad code pushes are <i>by far</i> the number one cause of outages, although these outages generally don't have the types of blast radius that architecture changes do.<p>Problem three: Rot/maintenance/upkeep<p>Now we get to the crux of the issue, which is something on the order of 3 machine failures per 1000 machines per day (my empirical estimation based on experience). Hard drives fail, circuits fry, network interfaces become finicky, hard drives fill up. A good portion of this can be resolved via blind auto-remediation. There's a problem on a machine? Wipe it clean and reconfigure it for its task. Assuming there are functioning autoremediation systems and no SPOFs, that database systems can handle master failures etc, that results in the most major "people need to handle this" problem being hardware failure. There must be someone actively procuring new hardware and replacing old hardware.<p>Systems can run up to 70% peak capacity, so that's likely on the order of 100 days of unaddressed machine rot before consequences will be seen depending on how capacity is allocated.<p>Problem four: Context change<p>While most change is done by the company itself, the company exists in a certain context. Governments can come down and companies via regulation like GDPR, which will definitely require the company to make changes. Security problems can require major or minor changes to be made. When the context a company exists within changes, the company must adapt, and these forced changes can result in outages. Depending on the change, the level of expertise of the remaining employees would likely dictate the outage.<p>So attempting to concretely estimate, I would guess something on the order of magnitude of months, maybe 3-6 months, with the caveat of good auto remediation and no SPOFs.