Many of the problems with staging environments are preserved if you remove staging environments:<p>How do I test with realistic data but <i>never</i> risk my customer's data in the process? Either way I'm going to make a copy of real user data. In one case it's much harder to have shared references to live data than in the other.<p>There are lots of failure modes for rehearsals that can take out the cluster. If you're doing it live, how do you know you don't have a request loop or just introduced an n+1 fanout in a previously well-behaved service? One that can take out said service and parts of its dependency graph? What if I write a query that crashes processes?<p>On the other hand, it's 'common sense' for staging to use similar hardware to production so that discrepancies don't catch you by surprise. But common sense isn't common, and people can rationalize the hell out of not spending money on precautions, and so they never are equivalent if they are partitioned. Which also makes thing like perf analysis very low fidelity.<p>More likely that nobody is right, everything is a tradeoff (sucks), and which pain points can your organization collectively stomach?