My team's systems play a critical role for several $100M of sales per day, such that if our systems go down for long enough, these sales will be lost. Long enough means at least several hours and in this time frame we can get things back to a good state, often without much external impact.<p>We too have manual processes in place, but for any manual process we document the rollback steps (before starting) and monitor the deployment. We also separate deployment of code with deployment of features (which is done gradually behind feature flags). We insist that any new features (or modification of code) requires a new feature flag; while this is painful and slow, it has helped us avoid risky situations and panic and alleviated our ops and on-call burden considerably.<p>For something to go horribly wrong, it would have to fail many "filters" of defects: 1. code review--accidentally introducing a behavioral change without a feature flag (this can happen, e.g. updating dependencies), 2. manual and devo testing (which is hit or miss), 3. something in our deployment fails (luckily this is mostly automated, though as with all distributed systems there are edge cases), 4. Rollback fails or is done incorrectly 5. Missing monitoring to alert us that issue still hasn't been fixed. 5. Fail to escalate the issue in time to higher-levels. 6. Enough time passes that we miss out on ability to meet our SLA, etc.<p>For any riskier manual changes we can also require two people to make the change (one points out what's being changed over a video call, the other verifies).<p>If you're dealing with a system where your SLA is in minutes, and changes are irreversible, you need to know how to practically monitor and rollback within minutes, and if you're doing something new and manually, you need to quadruple check everything and have someone else watching you make the change, or its only a matter of time before enough things go wrong in a row and you can't fix it. It doesn't matter how good or smart you are, mistakes will always happen when people have to manually make or initiate a change, and that chance of making mistakes needs to be built into your change management process.