TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Operating a large distributed system in a reliable way: practices I learned

378 pointsby gregdoesitalmost 6 years ago

12 comments

LaserToyalmost 6 years ago
I would add - you should start your monitoring with business metrics. Monitoring low level things is good to have but putting whole emphasis on it is missing the whole point. You should be able to answer at any point of time whether users are having a problem, what problem, how many users, what are they doing to workaround?<p>In other words, when person is in ER, doctors are looking for heartbeat, temperature, ... , not for some low level metric, like how many grams of oxygen is consumed by some specific cell.
评论 #20464933 未加载
bcoatesalmost 6 years ago
This is a good guide. One thing I&#x27;d add:<p>While you&#x27;re monitoring for traffic&#x2F;errors&#x2F;latency throw in minimum success rate. Make a good estimate of how many successful operations monitored systems will do per minute on the slowest hour of the year and put in an alert if the throughput drops <i>below</i> that. You&#x27;d be surprised how many 0 errors&#x2F;0 successes faults happen in a complex system, and a minimum throughput alarm will catch them.
评论 #20464781 未加载
评论 #20465862 未加载
cpursleyalmost 6 years ago
Can anyone recommend MOOCs and&#x2F;or university courses (open syllabus) covering Distributed Systems?
评论 #20464557 未加载
评论 #20464649 未加载
评论 #20467132 未加载
评论 #20464647 未加载
评论 #20464130 未加载
评论 #20467008 未加载
NewsAwarealmost 6 years ago
Nice article. The amount of in-house procedures&#x2F;tooling developed by the backend seems impressive (maybe some not invented here syndrome but can&#x27;t judge really). What I am astonished at though, is that the backend part of Über seems so professional while the Uber Android app feels like it&#x27;s build by 2 junior outsourced devs. Have rarely used an app which felt so buggy and awkward. (aside from regular crashes, e.g. when registering for Jump-bikes in Germany inside the app, I had to restart the app to have the corresponding menue item appear).
评论 #20465878 未加载
toolslivealmost 6 years ago
You need to have a strategy for backward (and forward) compatibility for your components. If the environment is large enough, you don&#x27;t exactly know which component is running what version of the code as they are constantly (holding back on) upgrading some part of your system. This includes extra (paramaters to) RPC calls, data type evolution, schema evolution. Without a decent strategy you&#x27;ll be in over your head quickly. (Tip : a version number for your API as part of the API v0.0.1, ain&#x27;t gonna be enough)
techie128almost 6 years ago
Interesting. Although it is on the lite side. For example, it doesn&#x27;t talk about chaos testing, defining effective and comprehensive metrics (KPIs), alert noise or running services like databases in an active-active (hot-hot) mode.
jwilliamsalmost 6 years ago
Good read. There are a few things that I&#x27;d throw on top as important;<p>- Active monitoring<p>- Chaos testing<p>- Cold start testing
joshgelalmost 6 years ago
&gt; I like to think of the effort to operate a distributed system being similar to operating a large organization, like a hospital.<p>Clearly never worked for a hospital. Hospitals need good engineers (and often don’t have them). Our ‘nines’ are embarrassing...
评论 #20463519 未加载
ggregoirealmost 6 years ago
Most of those advices apply to small non-distributed systems too.
drdreyalmost 6 years ago
I find it problematic that this recommends the Five Whys to get to &quot;the root cause&quot;. Haven&#x27;t we collectively moved past that?
评论 #20464260 未加载
VincentEvansalmost 6 years ago
Did you do all these things by yourself?<p>Really great content, but was really taken back by “I” used everywhere. Maybe it’s a new thing that I am not hip on that I ought to try - “I built and ran transaction processing software for Bloomberg! This is what I learned!”<p>But perhaps you really did all that by yourself, in that case sorry that i doubted you, looks like it’s a lot.
learnfromstoryalmost 6 years ago
Don&#x27;t really agree that this list could have come about through discussions with engineers at Google, Facebook, etc. The more computers you have the less important it becomes to monitor junk like CPU and memory utilization of individual machines. Host-level CPU usage alerting can&#x27;t possibly be a &quot;must-have&quot; if there are extremely large distributed systems operating without it.<p>If you&#x27;ve designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you.
评论 #20463887 未加载
评论 #20463953 未加载
评论 #20469331 未加载
评论 #20464301 未加载
评论 #20467050 未加载
评论 #20464795 未加载
评论 #20463832 未加载