Quantile Digest (Q-digest) or something similar is what I believe is desired here.<p>From what I understand it's a fixed size data structure, represents quantiles as a tree of histogram bands, pruning nodes with densities that vary the least from their parents to achieve error / size trade-off. They also have the property that you can merge them together and re-compress in order to turn second data into minute data, or compress more accurate (large) archival digests into smaller ones to say, support stable streaming of a varying number of metrics across a stream of varying bandwidth by sacrificing quality.<p>They're pretty simple because they're designed for sensor networks, but I think you could design similar structures with a dynamic instead of fixed value range, and variable size (prune nodes based on error threshold instead of or in addition to desired size).<p>If anyone knows of a time-series system using something like this, I'd love to learn about it.
You know what I've always found really useful? Entire distributions. Forget means, medians, percentiles, etc. — give me the complete distribution (along with sample size) so I can understand all of the nuances of the data.<p>(Better yet, just give me the raw data so I can analyze it myself. I find it hard to blindly trust someone else's conclusions considering all of the p-hacking going on nowadays.)
I've recently written a page about "averaging" percentiles correctly by approximating the combined histograms of two distributions. This is demonstrated in a live time diagram with logarithmic time axis here:<p><a href="http://www.siegfried-kettlitz.de/blog/posts/2015/11/28/linlog_plot_quantiles/" rel="nofollow">http://www.siegfried-kettlitz.de/blog/posts/2015/11/28/linlo...</a><p>If you have questions or comments, feel free to reply or email.
Great article, I ran into this exact problem at work today. You can also get a very similar problem if the timeseries aggregation system you're using does any pre-aggregation before you calculate the percentile - for example, if you sample your servers every 90 seconds, then any latency number it reports is likely already averaged over the requests the server received during that time period, meaning your 99th percentile number is really the latency of the 99th percentile server, not the 99th percentile request. Using latency buckets solves this problem as well, however.
The article is useful because it outlines many different ways in which to monitor the performance of a system, many of which are better than just looking at the mean and P99. However, the main thesis that "an average of a percentile is meaningless" is just plain wrong. If the distribution is fixed, then averaging different P99 measurements will give you the best possible estimate of what P99 is in the population (as opposed to your sample.) If the distribution is moving (you're making performance improvements or your user base is growing) then a moving average of a percentile will move with it.
Gave the post an upvote because it is interesting from a theoretical perspective, but I have a hard time imagining a real life scenario where averaging a 99 percentile will lead you to the wrong conclusion.<p>Perhaps I'm wrong, but whenever I'm looking at the tail of a distribution, it's usually just to understand the order of magnitude, not to reach a precise number.
The problem with percentiles is that you are discarding data points. e.g. Measuring the 98th percentile involves discarding the top 2% of data.<p>The problem with that is, sometimes the top 2% you are discarding might correlate to your top 2% of customers...and you are literally throwing their data away by using percentiles. Not good.<p>My recommendation is to pick two aggregation types. Maybe Percentile and Maximum, or Maximum and Mean, or Percentile and Mean. You can't really go wrong with that approach.
Does anyone know the math behind exactly how wrong the averaged percentiles are? My dim understanding of stats makes me think the central limit theorem is at play here; the averaged p99 values will tend towards a normal distribution, which is obviously wrong. Would love to be schooled on it.
I think the lesson is, don't just blindly calculate some numbers / metrics, have a look at your data (visually!) and see if it makes sense (for instance do you use the 99th / 95th / 90th percentile).
A day late, but I wanted to add: at New Relic (where I work) we ended up just deciding to store all the data points. We literally store every transaction and page view for our customers and then go back and read every data point when they are queried.