Practices of Reliable Software Design

231 pointsby fagnerbrack7 months ago

11 comments

There is a bunch of good advice here, but it's missed the most useful principal in my experience, probably because the motivating example is too small in scope:The way to build reliable software systems is to have multiple independent paths to success.This is the Erlang "let it crash" strategy restated, but I've also found it embodied in things like the architecture of Google Search, Tandem Computer, Ethereum, RAID 5, the Space Shuttle, etc. Basically, you achieve reliability through redundancy. For any given task, compute the answer multiple times in parallel, ideally in multiple independent ways. If the answer agrees, great, you're done. If not, have some consensus mechanism to detect the true answer. If you can't compute the answer in parallel, or you still don't get one back, retry.The reason for this is simply math. If you have n different events that must all go right to achieve success, the chance of this happening is x1 * x2 * ... * xn. This product goes to zero very quickly - if you have 20 components connected in series that are all 98% reliable, the chance of success is only 2/3. If instead you have n different events where any one can go right to achieve success, the chance of success is 1 - (1 - y1) * (1 - y2) * ... * (1 - yn). This inverse actually increases as the number of alternate pathways to success goes up and fast. If you have 3 alternatives each of which has just an 80% chance of success, but any of the 3 will work, then doing them all in parallel has a 97% chance of success.This is why complex software systems that must stay up are built with redundancy, replicas, failover, retries, and other similar mechanisms in place. And the presence of those mechanisms usually trumps anything you can do to increase the reliability of individual components, simply because you get diminishing returns to carefulness. You might spend 100x more resources to go from 90% reliability to 99% reliability, but if you can identify a system boundary and correctness check, you can get that 99% reliability simply by having 2 teams each build a subsystem that is 90% reliable and checking that their answers agree.

评论 #41784502 未加载

评论 #41783373 未加载

评论 #41788403 未加载

评论 #41787998 未加载

评论 #41786614 未加载

评论 #41785552 未加载

评论 #41783883 未加载

评论 #41789838 未加载

bruce5117 months ago

The first point is one that resonates strongly with me. Counter-intuitivly, the first instinct of a programmer should be "buy that, don't write it"Of course, as a programmer, this is by far not my first instinct. I am a programmer, my function is programming, not purchasing.Of course buying something is always cheaper (compared to the cost of my time) and will be orders of magnitude cheaper once the costs to maintain written-by-me code is added in.Things that are bought -tend- to last longer too. If I leave my job I leave behind a bunch of custom code nobody wants to work on. If I leave Redis behind, well, the next guy just carries on running Redis.I know all this. I advocate for all this. But I'm a programmer, send coders gotta code:) do it's not like we buy everything, I'm still there, still writing.Hopefully though my emphasis is on adding value. Build things that others will take over one-day. Keep designs clean, and code cleaner.And if I add one 'practice' to the list; Don't Be Clever. Clever code is hard to read, hard to understand, hard to maintain. Keep all code as simple as it can be. Reliable software is software that mostly isn't trying to be too clever.

评论 #41791097 未加载

评论 #41784750 未加载

评论 #41791004 未加载

taeric7 months ago

This misses one of the key things I have seen that really drives reliable software. Actually rely on the software.It sucks, because nobody likes the idea of the "squeaky wheel getting the grease." At the same time, nobody is surprised that the yard equipment that they haven't used in a year or so is going to need effort to get back to working. The longer it has been since it was relied on to work, the more likely that it won't work.To that end, I'm not arguing that all things should be the critical path. But the more code you have that isn't regularly exercised, the more likely it will be broken if anything around it changes.

评论 #41792035 未加载

l5870uoo9y7 months ago

I would add a ninth practice; throw errors. You find and fix them as opposed to errors that go silently unnoticed in the code base.

评论 #41787260 未加载

throwawayha7 months ago

Great points.But why do we invest so much complexity into outputting html/js/css.

评论 #41788552 未加载

评论 #41787139 未加载

SomewhatLikely7 months ago

My first thought upon seeing the prompt:<pre><code> If you would build an in-memory cache, how would you do it? It should have good performance and be able to hold many entries. Reads are more common than writes. I know how I would do it already, but I’m curious about your approach. </code></pre> Was to add this requirement since it comes up so often:<pre><code> Let's assume that keys accessed follow a power law, so some keys get accessed very frequently and we would like them to have the fastest retrieval of all. </code></pre> I'm not sure if there are any efficient tweaks to hash tables or b-trees that might help with this additional requirement. Obviously we could make a hash table take way more space than needed to reduce collisions, but with a decent load factor is the answer to just swap frequently accessed keys to the beginning of their probe chain? How do we know it's frequently accessed? Count-Min sketch?Even with that tweak, the hottest keys will still be scattered around memory. Wouldn't it be best if their entries could fit into fewer pages? So, maybe a much smaller "hot" table containing say the 1,000 most accessed keys. We still want a high load factor to maximize the use of cache pages so perhaps perfect hashing?

评论 #41786919 未加载

评论 #41796146 未加载

评论 #41787019 未加载

评论 #41796725 未加载

评论 #41787018 未加载

uzerfcwn7 months ago

It seems like the author had some very specific read and write pattern in mind when they designed for performance, but it's never explicitly stated. The problem setting only stated that "reads are more common than writes", but that's not really saying much when discussing performance. For example, a HTML server commonly has a small set of items that are most frequently read, and successive reads are not very strongly dependent. On the other hand, a PIM system may often get iterative reads correlated on some fuzzy search filter, which will be slow and thrash cache pretty badly if the system is optimized for different access patterns.When designing software, you first need to nail down the requirements, which I didn't really find in TFA.

hamdouni7 months ago

My takeaways for a more general pov :1. Make or buy2. Release a MVP3. Keep it simple4. Prepare for the worst5. Make it easy to tests7. Benchmark, monitor, log...

BillLucky7 months ago

Simple but elegant design principles, recommended

u8_friedrich7 months ago

> It is much easier to add features to reliable software, than it is to add reliability to featureful software.Not sure about this tbh. In a lot of cases yeah maybe. But when you are dealing with complicated business logic where there is a lot of bells and whistles required, building a simple reliable version can lead you into a naive implementation that might be reliable but very hard to extend, while making an unstable complicated thing can help you understand the pit falls and you can work back from there into something more reliable. So I think this depends very much on the context.

评论 #41787059 未加载

ActionHank7 months ago

Quick mental exercise on this.If someone posed this question to you in an interview and you used these principles, would you get the job?Probably not.