TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Tail latency might matter more than you think

91 pointsby timfabout 4 years ago

10 comments

cassianolealabout 4 years ago
The talk &quot;How NOT to Measure Latency&quot; [0] taught me all I needed to know in order to start worrying about tail latency in a very well presented way.<p>[0] <a href="https:&#x2F;&#x2F;www.infoq.com&#x2F;presentations&#x2F;latency-response-time&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.infoq.com&#x2F;presentations&#x2F;latency-response-time&#x2F;</a>
评论 #26875596 未加载
jkireabout 4 years ago
&gt; A common pattern in these systems is that there&#x27;s some frontend, which could be a service or some Javascript or an app, which calls a number of backend services to do what it needs to do.<p>I think an important idea here is that you should be trying to measure the experience of a user (or as close as possible). If there is a slow service somewhere in your stack, but has no impact on user experience, then who cares? Conversely, if users are complaining that the app feels sluggish, then it doesn&#x27;t matter if all your graphs say that everything is OK.<p>I find it helpful to split up graphs&#x2F;monitoring into two categories: 1) if these graphs look fine then the service is probably fine, and 2) if problems are being reported then these graphs might give an insight into <i>why</i> things are going wibbly. In general, we alert on the former and diagnose with the latter. Of course, its nigh on impossible to get perfect metrics that track actual user experience, but we&#x27;ve definitely found it worthwhile to try and get as close as possible to it.<p>---<p>Another fun problem with using summary statistics is they can easily &quot;lie&quot; if the API can do a variable amount of work. For example, if you have a &quot;get updates API&quot; that is called regularly to see updates since the last call, then you end up with two &quot;modes&quot;: 1) small amount of time between calls and so super fast and 2) a large amount of time between calls and so is slow. Now, in any given time period the <i>vast</i> majority of the calls are going to be super quick, but <i>every</i> user will hit the slow case the first time they open the app for the first time that day. This results in summary statistics that all but ignore those slow API calls when opening the app.
评论 #26874469 未加载
kqrabout 4 years ago
Another point often missed is the diagnostic value of tail measurements. One of the first things I do at any job is replace the 90th percentile with the maximum in all plots.<p>Sure, it gets messier, and definitely less visually appealing, but the reaction by others has uniformly been &quot;Did we have this data available all along and just never showed it?!&quot;<p>It&#x27;s also worth mentioning that even in a system where technically tail latencies aren&#x27;t a big problem, psychologically they are. If you visit a site 20 times and just one of those are slow, you&#x27;re likely to associate it mentally with &quot;slow site&quot; rather than &quot;fast site&quot;.
评论 #26877264 未加载
评论 #26880661 未加载
评论 #26877463 未加载
评论 #26872649 未加载
评论 #26875513 未加载
FriedrichNabout 4 years ago
This is why I don&#x27;t like require.js, one script requires this script which requires that script. If there is one hiccup somewhere down the line it causes the whole page to have to wait. One of my clients had their website made and wondered why it was so slow (the designers said it needed a faster server) but I found out it was requesting hundreds of .js files in roughly 10 waves. Causing the whole page to take up to 10 seconds to load completely.
评论 #26874399 未加载
评论 #26873641 未加载
cratermoonabout 4 years ago
A lovely thing about tail latency in the chains that the post talks about is how one service being slow can cascade. Especially in serial chains, when on component is slow, the rest are waiting, in the meantime using resources like memory, sockets, cpu cycles, that could be used to service other requests. In the worst cases, those other services can start responding slowly to other requests, resulting in further degradation.<p>Having circuit breakers and carefully tuning timeouts can help.
评论 #26874572 未加载
klodolphabout 4 years ago
&gt; Choosing Summary Statistics<p>This is the #1 thing that people get wrong. It&#x27;s something that otherwise smart software engineers get wrong because they don&#x27;t have enough of a background in data analytics.<p>The problem has two parts. One part of the problem is that once you reduce an observation to summary statistics, you can&#x27;t go back. The other part of a problem is that web services usually generate <i>too much data</i> if you don&#x27;t summarize.
评论 #26872325 未加载
评论 #26872184 未加载
oavdeevabout 4 years ago
I have once built an interactive calculator[1] for this exact problem, maybe someone else will find it useful too<p>[1] <a href="https:&#x2F;&#x2F;observablehq.com&#x2F;@oavdeev&#x2F;parallel-task-latency-calculator" rel="nofollow">https:&#x2F;&#x2F;observablehq.com&#x2F;@oavdeev&#x2F;parallel-task-latency-calc...</a>
PaulHouleabout 4 years ago
&quot;Tail Latency Matters More Than You Think&quot; is more like it.<p>When you see a beach ball or other indicator of delay on your computer you are quite likely to be experiencing tail latency.
timfabout 4 years ago
This reminded me of the similar cascading effect you get with availability across the aggregate of many services as discussed in &quot;The Calculus of Service Availability&quot;: <a href="https:&#x2F;&#x2F;queue.acm.org&#x2F;detail.cfm?id=3096459" rel="nofollow">https:&#x2F;&#x2F;queue.acm.org&#x2F;detail.cfm?id=3096459</a>
nostreboredabout 4 years ago
One complaint here is that serial&#x2F;parallel is not the way to think about most modern architectures. In modern architectures you are typically working with decoupled event buses which invert the relationship with dependencies. In this case, you become resilient to many negative impacts of tail latency as you&#x27;re inherently eventually consistent.
评论 #26873508 未加载