I learned this in a circuits class I took back in college, around 1984. Specifically, it was about amplifier circuits an op-amp design. Such designs are a puzzle of tradeoffs, and the teacher emphasized that often the "optimal" design is an inferior design in the light of real world constraints.<p>The globally optimal point on whatever thing we were optimizing might indeed be the highest peak of the graph, but if it is a sharp peak, any deviation in the voltage, temperature, or the real world values of the components would put the operating point far down the slope of that peak.<p>It was much better to find a reasonable operating point that had low sensitivity to voltage/temperature/component values but had acceptable behavior (gain, noise, whatever was important).<p>The surprising thing I learned from that class is that even though resistor and capacitor values and the gain of individual transistors of IC op-amps is an order of magnitude worse than for discrete designs, the matching of those terrible components was an order of magnitude better than for discrete components. Designers came up with many clever ways to take advantage of that to wring terrific performance from terrible components.<p>For example, say the nominal value of a given resistor in the design might be 4K ohms, and in the discrete design they might be 2%, or 1%, or 0.5% off (the ones with tighter tolerance get ever more expensive), while in the monolithic design the tolerance might be +/- 20%. But <i>all</i> the resistors would be off by the same amount and would match each other to a fraction of a percent, even across temperature and voltage variations.<p>The other funny effect is that when you buy a discrete 2% tolerance resistor, the distribution isn't gaussian around the mean. That is because the manufacturers have measured all of them and the ones within 0.5% get marked up and put in the 0.5% bin, and the remaining ones within 1% tolerance get marked up less and get put in the 1% bin. As a result, the distribution is bimodal on either side of the "hole" in the middle.
This is a good evaluation of the way risk posture can inform design decisions, but I think it sort of ignores the elephant in the room: A short term strategy wins more on average in a competitive context, especially when existential risk is on the table for entities who lose a competition. Most solutions to this problem involve solving hard coordination problems to change the balance of incentives or lower the lethality of competition. Figuring out systems that work for your level of risk tolerance is very achievable and can be incrementally improved, but designing for robustness is something that needs fractal alignment at higher meta-layers of incentives to be sustainable
It's an argument against such things as HTTP/3. That yields a slight increase in performance (maybe), for which there's a large increase in complexity.
Classic issue in military and industrial equipment, where you often accept somewhat less than maximum possible performance in exchange for robustness.
Mechanical designers think about this a lot, because their enemies are wear, vibration, and fragility.
It all depends on your the definition of the 'loss' function. One can actually include robustness/sensitivity as the goal and optimize for that.
Curiously, this is a big facet in our dev/ops re-organization.<p>For example, we in infra-operations are responsible to store data customers upload into your systems. This data has to be considered not reproducible, especially if it's older than a few days. If we lose it, we lose it for good and then people are disappointed and turn angry.<p>As such, large scale data wipes are handled very carefully with manual approvals from several different teams. The full deletion of a customer goes through us, account management, contract and us again. And this is fine. Even with the GDPR and such, it is entirely fine that deleting a customer takes 1-2 weeks. Especially because the process has caught errors in other internal processes, and errors in our customers processes. Suddenly you're the hero vendor if the customer goes "Oh fuck, noooooo".<p>On the other hand, stateless code updates without persistence changes are supposed to be able to move as fast as the build server gives. If it goes wrong, just deploy a fix with the next build or roll back. And sure you can construct situations in which code changes cause big, persistent, stateful issues, but these tend to be rare with a decent dev-team.<p>We as infra-ops and central services need to be robust and reliable and are fine shedding speed (outside of standard requests) for this. A dev-team with a good understanding of stateful and stateless changes should totally be able to run into a wall at full speed since they can stand back up just as quickly. We're easily looking at hours of backup restore for hosed databases. And no there is no way to speed it up without hardware changes.