TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Using load shedding to survive a success disaster - CRE life lessons

51 点作者 fhoffa超过 8 年前

6 条评论

stuckagain超过 8 年前
I have a set of rules for load shedding that I urge you to consider. First, and foremost, whenever you read (or are about to say) &quot;load shedding&quot; just mentally substitute the correct terminology, which is &quot;intentionally serving errors.&quot; This will put you in the right frame of mind to properly ponder the outcomes.<p>Secondly, the error path on your backend must be strictly cheaper than the success path, or the whole scheme doesn&#x27;t work. Particularly bad actions on error are for example logging an error at such a high severity that the log files need to be flushed and synced, which is likely to be tremendously expensive. Another example is taking a mutex and incrementing some error counter that normally wouldn&#x27;t be incremented on the serving path. If this tends to synchronize all your serving threads, your server will collapse.<p>Third, load shedding can only be implemented correctly if you control the client and the server, end-to-end. Perhaps you want to avoid hot spots by serving a soft error from an overloaded shard. If your client is guaranteed to try another shard (or just give up) this is a good approach. If the client might retry on the same shard, then it&#x27;s not helpful. You just &quot;shed load&quot; in such a way that you had to serve the same request twice.
评论 #13217133 未加载
drdrey超过 8 年前
Some additional things that can be done:<p>* soft-shedding, where instead of dropping a request (which might just incur a retry storm), sometimes it is appropriate to send back a cheap response so that the clients sees a successful response instead of an error<p>* route critical requests and non-critical requests to separate clusters that can be scaled and configured independently. The blog post mention doing that using DNS, but that also works for mid-tier services.<p>* build back-pressure into the client. Instead of a timeout or error, a well-conforming client can enter &quot;polite&quot; mode when it receives a signal that the backend is overwhelmed.
评论 #13216157 未加载
ChuckMcM超过 8 年前
We did this very successfully at Blekko (search engine) to keep the system from getting over loaded. The frontend engineer Bryn designed a really useful way of monitoring nginx connections to the backend and to shed load when they exceeded a threshold, and Greg designed a &#x27;geoknob&#x27; that would let us turn off traffic to regions of the world that were unlikely to be our primary customer base.<p>Also anomalous load shedding is a great indication of a traffic anomaly. Big scrapers sometimes appeared that way first even when their attack was coming from a wide number of IPs.
okreallywtf超过 8 年前
Can anyone comment on when at what level of scale this kind of issue might arise? It seems like it would be fairly costly to implement and test, is it safe to assume that when you attempt this it is either infeasible or too costly to continue to scale to be able to service the peak load? If you were running on bare-metal and simply could not add more instances&#x2F;databases&#x2F;caches&#x2F;etc fast enough I can understand that you might be able to deploy a software solution like this faster than increasing capacity. I could also understand being capped by the cost of continuing to scale but I can&#x27;t imagine putting the development effort into this kind of solution unless there were no other options?<p>Would these kind of techniques ever be a worthwhile exercise to a (early) startup or a small company that is hosting in the cloud or is it a last resort after you have already gotten quite large?
评论 #13217439 未加载
评论 #13217607 未加载
ozgune超过 8 年前
For anyone who&#x27;s interested, the following paper on load shedding is also a good read: <a href="https:&#x2F;&#x2F;pdos.csail.mit.edu&#x2F;6.828&#x2F;2010&#x2F;readings&#x2F;mogul96usenix.pdf" rel="nofollow">https:&#x2F;&#x2F;pdos.csail.mit.edu&#x2F;6.828&#x2F;2010&#x2F;readings&#x2F;mogul96usenix...</a><p>The paper basically identifies the problem as a &quot;livelock&quot;. You have a system that receives so many requests that instead of making any real progress, it tries to move those requests through different queues (Section 6).<p>If you&#x27;re building a distributed system (say SOA), I find that load shedding also has the nice property that it gives the system&#x27;s clients immediate feedback -- rather than having the client wait for a long time and make guesses.
intr1nsic超过 8 年前
This reads like an ideal use case for an object store service. My guess is with the traffic patterns of mobile clients, this was a necessity. Good read.