TechEcho

3 comments

lsh123almost 11 years ago

A good summary with one exception: the monitoring, instrumentation and logging didn't get enough attention. The failure is the norm so firsts and foremost you want to know when a failure occurs and then you need to be able to investigate what went wrong. You should literally monitor/instrument everything: every API call, every DB access, every page rendering should include code to monitor latency, result codes, payload size, etc. Every error (even benign) should be logged preferably with stack traces. Any unexpected condition should be checked and logged. All the instrumentation data should be graphed and stored for a long period of time so you can analyze the impact of your code changes on system performance and correlate it with system failures.

colechristensenalmost 11 years ago

I don't like 'no single point of failure' maxim because I think it leads people to make strange or incorrect decisions and neglect things in order to serve the maxim instead of doing what's best.<p>Being 'fail safe' is much more important than being redundant. That is, you need to design your product's failure. How well it works and how rarely it fails are important, but not nearly as important as how well it fails.<p>This means monitoring for knowing when it fails, auditing for knowing how it did fail after the fact, backups for after the fact, and most importantly (and harder to define) is predicting what can fail and how and designing your product's behavior after that failure.

评论 #7901324 未加载

bittermangalmost 11 years ago

There's always a single point of failure. The user.

No Single Points of Failure

3 comments

No Single Points of Failure

3 comments