A note on the name: "fail-safe" in engineering doesn't mean that a system <i>cannot</i> fail, but rather, that when it does, it does so in the safest manner possible.<p>The term originated with (or is strongly associated with) the Westinghouse railroad brake system. These are the pressurised air brakes on trains, in which air pressure holds the brake shoes <i>open</i> against spring pressure. Should integrity of the brakeline be lost, the brakes will fail in the activated position, slowing and stopping the train (or keeping a stopped train stopped).<p><a href="https://en.m.wikipedia.org/wiki/Railway_air_brake" rel="nofollow">https://en.m.wikipedia.org/wiki/Railway_air_brake</a><p>Fail-safe designs and practices can lead to some counterintuitive concepts. Aircraft landing on carrier decks, in which they are arrested by cables, apply full engine power and afterburner on landing. The idea is that should the arresting cable or hook fail, the aircraft can safely take off again.<p><a href="https://en.m.wikipedia.org/wiki/Fail-safe" rel="nofollow">https://en.m.wikipedia.org/wiki/Fail-safe</a><p>Upshot: "fail safe" doesn't mean "test all your failure conditions exhaustively". It may well mean to abort on any failure mode (see djb's software for examples). The most important criterion is that whatever the failure mode be, it be as safe as possible, and almost always, based on a very simple and robust design, mechanism, logic, or system.<p>From the description of this project, it strikes me that it may well be failing (unsafely?) to implement these concepts. Charles Perrow, scholar of accidents and risks, notes that it's often safety and monitoring systems themselves which play a key role in accidents and failures.
Very cool. Consistent and clear retry, backoff, and failure behaviors are an important part of designing robust systems, so it's disappointing how uncommon they are. If I were starting a new Java project today I would almost certainly want to use this library instead of the various threads and timers I had to hack together years ago.
How is this distinct from Hystrix (<a href="https://github.com/Netflix/Hystrix" rel="nofollow">https://github.com/Netflix/Hystrix</a>)? Why should I use one over the other?
Please find some of these patterns for .net\azure\c# stack here - <a href="https://msdn.microsoft.com/en-us/library/dn568099.aspx" rel="nofollow">https://msdn.microsoft.com/en-us/library/dn568099.aspx</a>
Beware of runaway retries: <a href="https://blogs.msdn.microsoft.com/oldnewthing/20051107-20/?p=33433" rel="nofollow">https://blogs.msdn.microsoft.com/oldnewthing/20051107-20/?p=...</a><p>Personally, I'd rather systems fail quickly, with retries only at the highest (application) and lowest (TCP) levels.