We lived in a rural area when I was a kid. My dad told me once that his buddy had to measure the ptarmigan[1] population in the mountains each year as part of his job.<p>He did this by hiking a fixed route, and at fixed intervals scare the birds so they would fly and count.<p>The total count was submitted to some office which used it to estimate the population.<p>One year he had to travel abroad when the counting had to be done, so he recruited a friend and explained in detail how to do it.<p>However when the day of the counting arrived his friend forgot, and it was a huge hassle anyway so he just submitted a number he figured was about right, and that was that.<p>Then one day the following year, the local newspaper had a frontpage headline stating "record increase in ptarmigan population".<p>The reason it was big news was that the population estimate was used to set the hunting quotas, something his friend had not considered...<p>[1]: <a href="https://en.wikipedia.org/wiki/Rock_ptarmigan" rel="nofollow">https://en.wikipedia.org/wiki/Rock_ptarmigan</a>
Hello! o/<p>I’m the author of this post. Happy to answer any questions, and love to get feedback.<p>The code for all of my posts can be found at <a href="https://github.com/samwho/visualisations">https://github.com/samwho/visualisations</a> and is MIT licensed, so you’re welcome to use it :)
This is a really nicely written and illustrated post.<p>An advanced extension to this is that there are algorithms which calculate the number of records to skip rather than doing a trial per record. This has a good write-up of them: <a href="https://richardstartin.github.io/posts/reservoir-sampling" rel="nofollow">https://richardstartin.github.io/posts/reservoir-sampling</a>
Great article and explanation.<p>On a practical level though, this would be the last thing I would use for log collection. I understand that when there is a spike, something has to be dropped. What should this something be?<p>I don't see the point of being "fair" about what is dropped.<p>I would use fairness as a last resort, after trying other things:<p>Drop lower priority logs: If your log messages have levels (debug, info, warning, error), prioritize higher-severity events, discarding the verbose/debug ones first.<p>Contextual grouping: Treat a sequence of logs as parts of an activity. For a successful activity, maybe record only the start and end events (or key state changes) and leave out repetitive in-between logs.<p>Aggregation and summarization: Instead of storing every log line during a spike, aggregate similar or redundant messages into a summarized entry. This not only reduces volume but also highlights trends.
This is a great post, very approachable, with excellent visualizations.<p>We use a variation of this sort of thing at $WORK to solve a related problem, where you want to estimate some percentile from a running stream, with the constraints that the percentile you want to choose changes from time to time but is generally static for a trillion or more iterations (and the underlying data is quasi-stationary). If you back the process by a splay tree, you can get amortized O(1) percentile estimates (higher error bars for a given RAM consumption than a number of other techniques, but very fast).<p>You can also play with the replacement probability to, e.g., have a "data half-life" (denominated either in time or discrete counts) and bias the estimate toward recent events, which is more suitable for some problems.
Great article and nice explanation. I believe this describes “Algorithm R” in this paper from Vitter, who was probably the first to describe it: <a href="https://www.cs.umd.edu/~samir/498/vitter.pdf" rel="nofollow">https://www.cs.umd.edu/~samir/498/vitter.pdf</a>
From data science perspective, the volume of the data also encodes really valuable information, so it’s good to also log the number of data points each one represents. For example, if sampling rate comes out to be 10%, have a field that encodes 10. This way you can rebuild and estimate most statistics like count, sum, average, etc.
Very well put together. If you are curious about the weighted version, I tried to explain it some here: <a href="https://gregable.com/2007/10/reservoir-sampling.html" rel="nofollow">https://gregable.com/2007/10/reservoir-sampling.html</a><p>There's also a distributed version, easy with a map reduce.<p>Or the very simple algorithm: generate a random paired for each item in the stream and keep the top N ordered by that random.
This reminds me that I need to spend more time thinking about the algorithm the allies used to count German tanks by serial number. The people in the field estimated about 5x as many tanks as were actually produced but the serial number trick was over 90% accurate.
This is a great post that also illustrates the tradeoffs inherent in telemetry collection (traces, logs, metrics) for analysis. It's a capital-H Hard space to operate in that a lot of developers either don't know about, or take for granted.
Used in Dropwizard Metrics, among other places. <a href="https://metrics.dropwizard.io/4.2.0/" rel="nofollow">https://metrics.dropwizard.io/4.2.0/</a>
I remember this turning up in a google interview back in the day. The interview was really expecting me not to know the algorithm and to flounder about trying to solve the problem from first principles. Was fun to just shortcut things by knowing the answer that time.
I discovered this in one of those coding quizzes they give you to get a job. I was reviewing questions and one of them was this exact thing. I had no idea how to do it until I read the answer, and then it was obvious.