> We only merge code that is ready to go live<p>Cool story, but you don't _know_ if its ready until after.<p>Look, staging environments are not great, for the reasons described. But just killing staging and having done with it isn't the answer either. You need to _know_ when your service is fucked or not performing correctly.<p>The only way that this kind of deployment is practical _at scale_ is to have comprehensive end-to-end testing constantly running on prod. This was the only real way we could be sure that our service was fully working within acceptable parameters. We ran captured real life queries constantly in a random order, at a random time (caching can give you a false sense of security, go on, ask me how I know)<p>At no point is monitoring strategy discussed.<p>Unless you know how your service is supposed to behave, and you can describe that state using metrics, your system isn't monitored. Logging is too shit, slow and expensive to get meaningful near realtime results. Some companies expend billions taming logs into metrics. don't do that, make metrics first.<p>> You’ll reduce cost and complexity in your infrastructure<p>I mean possibly, but you'll need to spend a lot more on making sure that your backups work. I have had a rule for a while that all instances must be younger than a month in prod. This means that you should be able to re-build _from scratch_ all instances <i>and datastores</i>. Instances are trivial to rebuild, databases should also be, but often arn't. If you're going to fuck around an find out in prod, then you need good well practised recovery procedures<p>> If we ever have an issue in production, we always roll forward.<p>I mean that cute and all, but not being able to back out means that you're fucked, you might not think you're fucked, but that's because you've not been fucked yet.<p>its like the old addage, there are two states of system admin: Those who are about to have data loss, and those who have had data loss.