The whole idea of (<i>reliably</i>) deploying and rolling back without downtime I don't think gets nearly enough meme-worthy attention on HN. It's quite complicated and depends entirely on a number of variables (specifically how you do <i>everything</i>). I wrote an internal paper once which was probably 30 pages just to explain why we couldn't do automatic rollbacks.<p>The most important parts of such a system (the ones mentioned in this post, anyway) don't get nearly enough attention:<p>- "centrally driven migrations": In any distributed service architecture, there are always too many interdependent pieces. You can't reliably touch thing A without also touching things B, C, D, etc. If you want any chance of automation or responding to failure without downtime, you must have a system which is aware of the changing state of everything and can change all the parts at a whim.<p>- "database migrations": This is again very complicated and depends on how your code and database are architected. You literally can't do migrations if your code and schema aren't set up right, and if you don't make the right kind of changes. How do you do this? Time to write a book...<p>- "wrap the old library": I can't remember what this is called, but it has a name. Anyway, the idea is hiding any change behind what is effectively a feature flag wrapper allows you to deploy the change without it being enabled, use the feature flag to test the change in production (on only one rest, on a percentage of requests, on one whole node/pod, etc), and then delete the old code eventually. This isn't just for features; you can replace entire interfaces, software stacks, whole systems this way, either piecemeal or entirely. Very powerful, but again, requires a specific approach not only in implementation but in use.<p>- "use automated rollback checks": What kind of checks? Checking what? In what way? At what time/stage? What happens when one fails? Do you do them in series or parallel? <i>Can you</i> do them in series or parallel? etc<p>- "deploy least critical services first": With enough interdependent services, you're going to hit cases where you <i>have to</i> upgrade parts B and C effectively simultaneously before you can upgrade A, etc. So for "no downtime", it will take a lot of coordination, and very explicit linkage and checking of specific new services, etc. There are ways to do this, but it's specific to your implementation and services, so this is another example of how you have to know exactly what's going on, and then set up the deployment to account for your specific dependency tree and how they react when they're run.<p>So many people I've run into don't think about any of these things. They literally say things like "automated rollbacks are easy, we did it at XYZ place", as if none of the above matter at all. They literally stick their head in the sand because they <i>want to believe</i> that it should be easy. But any engineer worth their salt will tell you that to do it correctly and reliably is bloody complicated.