This to me shows Google hasn't gotten in place sufficient monitoring to know the <i>scale</i> of problems and the correct scale of response.<p>For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)<p>Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.<p>Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)