Latency Sneaks Up on You

135 点作者 luord超过 3 年前

10 条评论

Great article, and a line of reasoning that ought to be more widely known. There is a similar tradeoff between latency and utilization in hash tables, for essentially the same reason.The phenomenon described by the author can lead to interesting social dynamics over time. The initial designer of a system understands the latency/utilization tradeoff and dimensions the system to be underutilized so as to meet latency goals. Then the system is launched and successful, so people start questioning the low utilization, and apply pressure to increase utilization in order to reduce costs. Invariably latency goes up, customers complain. Customers escalate, and projects are started to reduce latency. People screw around at the margin changing number of threads etc, but the fundamental tradeoff cannot be avoided. Nobody is happy in the end. (Been through this cycle a few times already.)

评论 #28333692 未加载

wpietri超过 3 年前

Mostly agreed, and I think the point about efficiency working against latency is both important and widely ignored. And not just in software, but software process.There's a great book called Principles of Product Development Flow. It carefully looks at the systems behind how things get built. Key to any good feedback loop is low latency. So if we want our software to get better for users over time, low latencies from idea to release are vital. But most software processes are tuned for keeping developers 100% busy (or more!), which drastically increases system latency. That latency means we get a gain in efficiency (as measured by how busy developers are) but a loss in how effective the system is (as determined by creation of user and business value).

azundo超过 3 年前

This principle applies as much to the work we schedule for ourselves (or our teams) as it does to our servers.As teams get pushed to efficiently utilize scarce and expensive developer resources to their max they can also end up with huge latency issues for unanticipated requests. Not always easy to justify why planned work is way under a team's capacity though even if it leads to better overall outcomes at the end of the day.

评论 #28330747 未加载

jrochkind1超过 3 年前

Mostly a reminder/clarification of things I knew, but a good and welcome one well-stated, because I probably sometimes forget. (I don't do performance work a lot).But this:> If you must use latency to measure efficiency, use mean (avg) latency. Yes, average latencyNot sure if I ever thought about it before, but after following the link[1] where OP talks more about it, they've convinced me. Definitely want mean latency at least in addition to median, not median alone.[1]: <a href="https://brooker.co.za/blog/2017/12/28/mean.html" rel="nofollow">https://brooker.co.za/blog/2017/12/28/mean.html</a>

评论 #28330617 未加载

评论 #28329330 未加载

dwohnitmok超过 3 年前

I think the article is missing one big reason why we care about 99.99% or 99.9% latency metrics and that is that we can have high latency spikes even with low utilization.The majority of computer systems do not deal with high utilization. As has been pointed out many times, computers are really fast these days, and many businesses may be able to get away through their entire lifetime on a single machine if the underlying software makes efficient use of the hardware resources. And yet even with low utilization, we still have occasional high latency that still occurs often enough to frustrate a user. Why is that? Because a lot of software these days is based on a design that intersperses low-latency operations with occasional high-latency ones. This shows up everywhere: garbage collection, disk and memory fragmentation, growable arrays, eventual consistency, soft deletions followed by actual hard deletions, etc.What this article is advocating for is essentially an amortized analysis of throughput and latency, in which case you do have a nice and steady relationship between utilization and latency. But in a system which may never come close to full utilization of its underlying hardware resources (which is a large fraction of software running on modern hardware), this amortized analysis is not very valuable because even with very low utilization we can still have very different latency distributions due to the aforementioned software design and what tweaks you make to that.This is why many software systems don't care about the median latency or the average latency, but care about the 99 or 99.9 percentile latency: there is a utilization-independent component to the statistical distribution of your latency over time and for those many software systems which have low utilization of hardware resources that is the main determinant of your overall latency profile, not utilization.

评论 #28331960 未加载

shitlord超过 3 年前

If you're interested, there's a whole branch of mathematics that models these sorts of phenomena: <a href="https://en.wikipedia.org/wiki/Queueing_theory" rel="nofollow">https://en.wikipedia.org/wiki/Queueing_theory</a>

ksec超过 3 年前

OK. I am stupid. I dont understand the article.>> If you must use latency to measure efficiency, use mean (avg) latency. Yes, average latencyWhat is wrong with measuring latency at 99.99 percentile with a clear guideline that optimising efficiency ( in this article higher utilisation ) should not have trade off on latency?Because latency is part of user experience. And UX comes first before anything else.Or does it imply that there are lot of people who dont know the trade off between latency and utilisation? Because I dont know anyone who has utilisation to 1 or even 0.5 in production.

评论 #28330732 未加载

评论 #28329506 未加载

评论 #28330229 未加载

azepoi超过 3 年前

"How not to measure latency" video by Gil Tene<a href="https://youtu.be/lJ8ydIuPFeU" rel="nofollow">https://youtu.be/lJ8ydIuPFeU</a>

dvh超过 3 年前

Grace Hopper explaining 1 nanosecond: <a href="https://youtu.be/9eyFDBPk4Yw" rel="nofollow">https://youtu.be/9eyFDBPk4Yw</a>

the_sleaze9超过 3 年前

Seems like a trivially simple article, and I remain unconvinced of the conclusion. I think this is a confident beginner giving holistic, overly prescriptive advice. That is to say: feel free to skip and ignore.In my experience, if you want monitoring (or measuring for performance) to provide any value what so ever, you must measure multiple different aspects of the system all at once. Percentiles, averages, load, responses, i/o, memory, etc etc.The only time you would need a single metric would possibly be for alerting, and a good alert (IMHO) is one that triggers for impending doom, which the article states percentiles are good for. But I think alerts are outside of the scope of this article.TLDR; Review of the article: `Duh`

评论 #28328038 未加载

评论 #28328555 未加载