How percentile approximation works and why it's more useful than averages

687 pointsby od0over 3 years ago

38 comments

bqeover 3 years ago

Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.[1]: <a href="https://github.com/cxxr/LiveStats" rel="nofollow">https://github.com/cxxr/LiveStats</a>[2]: <a href="https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf" rel="nofollow">https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf</a>

评论 #28527475 未加载

评论 #28529319 未加载

评论 #28527503 未加载

评论 #28535359 未加载

评论 #28534291 未加载

评论 #28534436 未加载

评论 #28532286 未加载

评论 #28529717 未加载

评论 #28528570 未加载

评论 #28542275 未加载

评论 #28530340 未加载

abnryover 3 years ago

One way to think about why we tend to use averages instead of medians is that it is related to a really deep theorem in probability: The Central Limit Theorem.But I think we can twist our heads and see in a way that this is backwards. Mathematically, the mean is much easier to work with because it is linear and we can do algebra with it. That's how we got the Central Limit Theorem. Percentiles and the median, except for symmetric distributions, are not as easy to work with. They involve solving for the inverse of the cumulative function.But in many ways, the median and percentiles are a more relevant and intuitive number to think about. Especially in contexts where linearity is inappropriate!

评论 #28531027 未加载

评论 #28558182 未加载

评论 #28538976 未加载

rahimnathwaniover 3 years ago

For some things, you can't even sensibly measure the mean. For example, if you're measuring the mean response time for a service, a single failure/timeout makes the mean response time infinite (because 100 years from now the response still hasn't been received)."Why Averages Suck and Percentiles are Great": <a href="https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/" rel="nofollow">https://www.dynatrace.com/news/blog/why-averages-suck-and-pe...</a>

评论 #28527670 未加载

评论 #28527393 未加载

hnuser123456over 3 years ago

Gamers have an intuitive sense of this. Your average framerate can be arbitrarily high, but if you have a big stutter every second between the smooth moments, then a lower but more consistent framerate may be preferable, typically expressed as the 1% and 0.1% slowest frames, which at a relatively typical 100fps, represents the slowest frame every second and every 10 seconds.

评论 #28528753 未加载

评论 #28528166 未加载

评论 #28533913 未加载

评论 #28536511 未加载

评论 #28546350 未加载

评论 #28528901 未加载

评论 #28530429 未加载

评论 #28529118 未加载

评论 #28528529 未加载

michaelmdresserover 3 years ago

Looking particularly at latency measurements, I found the "How NOT to Measure Latency" [1] talk very illuminating. It goes quite deep into discussing how percentiles can be used and abused for measurement.[1]: <a href="https://www.infoq.com/presentations/latency-response-time/" rel="nofollow">https://www.infoq.com/presentations/latency-response-time/</a>

评论 #28531708 未加载

评论 #28532947 未加载

madarsover 3 years ago

Good opportunity to plug <a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet" rel="nofollow">https://en.wikipedia.org/wiki/Anscombe%27s_quartet</a> : if you don't know much about the underlying distribution, simple statistics don't describe it well.From Wikipedia description: Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it, and the effect of outliers and other influential observations on statistical properties.

评论 #28527693 未加载

评论 #28527671 未加载

评论 #28527464 未加载

评论 #28527821 未加载

评论 #28535828 未加载

评论 #28534571 未加载

tunesmithover 3 years ago

I just recently tried giving a presentation to my department (they're developers, I'm architect) about this stuff and they all just sort of blinked at me. It brought in Little's Law and Kingman's Formula, in an attempt to underscore why we need to limit variation in the response times of our requests.There are a bunch of queuing theory formulas that are really cool but don't exactly apply if your responses vary a lot like the article describes. I think the assumption is that response time distributions are exponential distributions, which I don't think is a good assumption (is an Erlang distribution an exponential distribution?) Nevertheless, hooking the equations up to some models is a good way to get directional intuition. I didn't realize how steep the performance drop-off is for server utilization until I started moving sliders around.Our ops team doesn't really follow this either. We're not a huge department though - is this the kind of stuff that an SRE team usually pays attention to?

评论 #28527581 未加载

bluesmoonover 3 years ago

This is exactly the algorithm we developed at LogNormal (now part of Akamai) 10 years ago for doing fast, low-memory percentiles on large datasets.It's implemented in this Node library: <a href="https://github.com/bluesmoon/node-faststats" rel="nofollow">https://github.com/bluesmoon/node-faststats</a>Side note: I wish everyone would stop using the term Average to refer to the Arithmetic mean. "Average" just means some statistic used to summarize a dataset. It could be the Arithmetic Mean, Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch of other statistics. We're stuck with AVG because that's the function used by early databases and Lotus 123.

评论 #28528384 未加载

louisnowover 3 years ago

```To calculate the 10th percentile, let’s say we have 10,000 values. We take all of the values, order them from largest to smallest, and identify the 1001st value (where 1000 or 10% of the values are below it), which will be our 10th percentile.```Isn't this contradictory? If we order the values from largest to smallest and take the 1001st value, then 10 % of the values are above/larger and not below/smaller. I believe it should say order from smallest to largest.

评论 #28528607 未加载

评论 #28528605 未加载

评论 #28541012 未加载

tdiffover 3 years ago

One of the difficulties when dealing with percentiles is estimating error of the estimated percentile values, which is not necessarily trivial compared to estimating error of mean approximation (e.g. see <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6294150/" rel="nofollow">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6294150/</a>)

satvikpendemover 3 years ago

I often see HN articles crop up soon after a related post, in this case this Ask HN poster [0] being driven crazy by people averaging percentiles and them not seeing that it's a big deal. It's pretty funny to see such tuples of posts appearing.<a href="https://news.ycombinator.com/item?id=28518795" rel="nofollow">https://news.ycombinator.com/item?id=28518795</a>

ChuckMcMover 3 years ago

This is a really important topic if you're doing web services. Especially if you're doing parallel processing services. When I was at Blekko the "interesting" queries were the ones above the 95th percentile because they always indicated "something" that hadn't worked according to plan. Sometimes it was a disk going bad on one of the bucket servers, sometimes it was a network port dropping packets, and sometimes it was a corrupted index file. But it was always something that needed to be looked at and then (usually) fixed.It also always separated the 'good' Ad networks from the 'bad' ones as the bad ones would take to long to respond.

aidenn0over 3 years ago

Be careful translating percentiles of requests to percentiles of users; if less than 10% of your requests take over 1 second, but a typical user makes 10 requests, it's possible that the majority your users are seeing a request take over 1 second.

评论 #28529609 未加载

axpy906over 3 years ago

There’s something called a five number summary in statistics: mean, median, standard deviation, 25th percentile and 75th percentile.The bonus is that the 75th - 50th gives you the interquartile range.Mean is not a robust measure and as such you need to look at variety to truly understand the spread of your data.

评论 #28530025 未加载

评论 #28527943 未加载

jonnydubowskyover 3 years ago

I really enjoyed this post! The author also wrote an interactive demonstration of the concepts (using Desmos). It's super helpful.<a href="https://www.desmos.com/calculator/ty3jt8ftgs" rel="nofollow">https://www.desmos.com/calculator/ty3jt8ftgs</a>

评论 #28533331 未加载

Lightbodyover 3 years ago

Whenever this topic comes up, I always encourage folks to watch this 2011 classic 15m talk at the Velocity conference by John Rauser:<a href="https://www.youtube.com/watch?v=coNDCIMH8bk" rel="nofollow">https://www.youtube.com/watch?v=coNDCIMH8bk</a>

varelazover 3 years ago

Percentiles for sure are better than average if you want to explore distribution: there are several percentiles comparing to single average. However median is harder to use if you want to do calculations based on this metric. For example distribution of sample median could be a problem, if you want to understand confidence interval for it for example.

评论 #28527629 未加载

achenatxover 3 years ago

Ive been trying to get the marketing team to always include a std deviation with averages. Average alone is simply not useful, standard deviation is a simple way to essentially include percentiles.They regularly compare experiments to the mean but dont use a T test to ensure the results are actually different from the mean.

评论 #28527621 未加载

评论 #28527442 未加载

评论 #28527431 未加载

fbinkyover 3 years ago

Thanks for the refresher! I am using the timescaledb-toolkit with much success. LTTB, for example. Excellent.

motohagiographyover 3 years ago

Recovering product manager in me sees 90th percentile queries with outlier high latency and starts asking instead of how to reduce it, how we can to spin it out into a dedicated premium query feature, as if they're willing to wait, they're probably also willing to pay.Highly recommend modelling your solution using queueing theory with this: <a href="https://queueing-tool.readthedocs.io/en/latest/" rel="nofollow">https://queueing-tool.readthedocs.io/en/latest/</a>As an exercise in discovering the economics of how your platform works, even just thinking about it in these terms can save a great deal of time.

zeteoover 3 years ago

If you want to calculate exact percentiles there's a simple in-place algorithm that runs in expected O(n) time. You basically do quicksort but ignore the "wrong" partition in each recursive step. For instance if your pivot is at the 25% percentile and you're looking for the 10% percentile you only recurse on the "left" partition at that point. It's pretty easy to implement. (And rather straightforward to change to a loop, if necessary.)

solumosover 3 years ago

looking at a histogram is probably the best thing you can do to understand how data is distributed

123pie123over 3 years ago

I had to explain the data usage of an interface that looked extremely busy from the standard graphsI did a percentile graph of the usage - the data was only typically using 5% of the maximum throughput, no-one could really understand the graph thoughso I did a zoomed-in version of the normal data usage graph and it looked like a blip lasting 1/20 of the time - everyone got that - eg it was peaking every few seconds and then doing nothing for ages

datavirtueover 3 years ago

Can someone bake this down to a sentence? I think I understand what they are saying since I have been faced with using these metrics and in having been tempted to use average response times I recognized that average is not a good baseline since it moves in relation to the outliers (which are usually the problem requests and there can be many or few).How do you calculate these percentiles and use them as a trigger for alerts?

lewispbover 3 years ago

Side note, but I love the animations, code snippet design and typography in this blog post.Will think about how I can improve my own blog with these ideas.

评论 #28530572 未加载

ablealover 3 years ago

Just as a note, same topic: <a href="https://news.ycombinator.com/item?id=19552112" rel="nofollow">https://news.ycombinator.com/item?id=19552112</a>""" Averages Can Be Misleading: Try a Percentile (2014) (elastic.co) 199 points by donbox on April 2, 2019 | | 55 comments """

jpgvmover 3 years ago

If you are doing this sort of work I highly recommend the Datasketches library. It's used in Druid, which is a database specifically designed for these sorts of aggregations over ridiculously large datasets.

jaygrecoover 3 years ago

FYI, typo:> “In the below graph, half of the data is to the left (shaded in blue), and a half is to the right (shaded in purple), with the 50th percentile directly in the center.”But the half on the right is actually shaded yellow.

评论 #28549679 未加载

评论 #28538552 未加载

airstrikeover 3 years ago

Sorry for the minor nitpick, but I find it a bit unusual (disappointing?) that there's an image of a candlestick chart at the very top, but the article only uses API response times as examples...

Bostonianover 3 years ago

If you think the data is normal-ish but want to account for skew and kurtosis, you can try fitting a distribution such as the skewed Student's t -- there are R packages for this.

camel_gopherover 3 years ago

Toss the percentiles to the curb and just use a histogram.

k__over 3 years ago

Shouldn't averages&variance be enough for start?

pachicoover 3 years ago

Surprisingly, many software engineers I know never used percentiles and keep using mean average. True story.

评论 #28531205 未加载

评论 #28529519 未加载

评论 #28528838 未加载

ekianjoover 3 years ago

Is there anyone who still uses averages ?

tiffanyhover 3 years ago

Statistics 101. Mean, median and mode.

deftover 3 years ago

Looks like timescale did a big marketing push this morning only for their whole service to go down minutes later. lol.

mherdegover 3 years ago

I've skimmed some of the literature here when I've spent time trying to help people with their bucket boundaries for Prometheus-style instrumentation of things denominated in "seconds", such as processing time and freshness.My use case is a little different from what's described here or in a lot of the literature. Some of the differences:(1) You have to pre-decide on bucket values, often hardcoded or stored in code-like places, and realistically won't bother to update them often unless the data look unusably noisy.(2) Your maximum number of buckets is pretty small -- like, no more than 10 or 15 histogram buckets probably. This is because my metrics are very high cardinality (my times get recorded alongside other dimensions that may have 5-100 distinct values, things like server instance number, method name, client name, or response status).(3) I think I know what percentiles I care about -- I'm particularly interested in minimizing error for, say, p50, p95, p99, p999 values and don't care too much about others.(4) I think I know what values I care about knowing precisely! Sometimes people call my metrics "SLIs" and sometimes they even set an "SLO" which says, say, I want no more than 0.1% of interactions to take more than 500ms. (Yes, those people say, we have accepted that this means that 0.1% of people may have an unbounded bad experience.) So, okay, fine, let's force a bucket boundary at 500ms and then we'll always be measuring that SLO with no error.(5) I know that the test data I use as input don't always reflect how the system will behave over time. For example I might feed my bucket-designing algorithm yesterday's freshness data and that might have been a day when our async data processing pipeline was never more than 10 minutes backlogged. But in fact in the real world every few months we get a >8 hour backlog and it turns out we'd like to be able to accurately measure the p99 age of processed messages even if they are very old... So despite our very limited bucket budget we probably do want some buckets at 1, 2, 4, 8, 16 hours, even if at design time they seem useless.I have always ended up hand-writing my own error approximation function which takes as input like(1) sample data - a representative subset of the actual times observed in my system yesterday(2) proposed buckets - a bundle of, say, 15 bucket boundaries(3) percentiles I care aboutthen returns as output info about how far off (%age error) each estimated percentile is from the actual value for my sample data.Last time I looked at this I tried using libraries that purport to compute very good bucket boundaries but they give me, like, 1500 buckets with very nice tiny error, but no clear way to make real-world choice about collapsing this into a much smaller set of buckets with comparatively huge, but manageable, error.I ended up just advising people to* set bucket boundaries at SLO boundaries, and be sure to update when the SLO does* actually look at your data and understand the data's shape* minimize error for the data set you have now; logarithmic bucket sizes with extra buckets near the distribution's current median value seems to work well* minimize worst-case error if the things you're measuring grow very small or very large and you care about being able to observe that (add extra buckets)

评论 #28529253 未加载

评论 #28529793 未加载

评论 #28530739 未加载

ruchin_kover 3 years ago

Spent several years in venture capital investing and averages were always misleading - as Nassim Taleb says "Never cross a river that is on average 4 feet deep"

评论 #28529232 未加载