We lived in a rural area when I was a kid. My dad told me once that his buddy had to measure the ptarmigan[1] population in the mountains each year as part of his job.<p>He did this by hiking a fixed route, and at fixed intervals scare the birds so they would fly and count.<p>The total count was submitted to some office which used it to estimate the population.<p>One year he had to travel abroad when the counting had to be done, so he recruited a friend and explained in detail how to do it.<p>However when the day of the counting arrived his friend forgot, and it was a huge hassle anyway so he just submitted a number he figured was about right, and that was that.<p>Then one day the following year, the local newspaper had a frontpage headline stating "record increase in ptarmigan population".<p>The reason it was big news was that the population estimate was used to set the hunting quotas, something his friend had not considered...<p>[1]: <a href="https://en.wikipedia.org/wiki/Rock_ptarmigan" rel="nofollow">https://en.wikipedia.org/wiki/Rock_ptarmigan</a>
Hello! o/<p>I’m the author of this post. Happy to answer any questions, and love to get feedback.<p>The code for all of my posts can be found at <a href="https://github.com/samwho/visualisations">https://github.com/samwho/visualisations</a> and is MIT licensed, so you’re welcome to use it :)
This is a really nicely written and illustrated post.<p>An advanced extension to this is that there are algorithms which calculate the number of records to skip rather than doing a trial per record. This has a good write-up of them: <a href="https://richardstartin.github.io/posts/reservoir-sampling" rel="nofollow">https://richardstartin.github.io/posts/reservoir-sampling</a>
Great article and explanation.<p>On a practical level though, this would be the last thing I would use for log collection. I understand that when there is a spike, something has to be dropped. What should this something be?<p>I don't see the point of being "fair" about what is dropped.<p>I would use fairness as a last resort, after trying other things:<p>Drop lower priority logs: If your log messages have levels (debug, info, warning, error), prioritize higher-severity events, discarding the verbose/debug ones first.<p>Contextual grouping: Treat a sequence of logs as parts of an activity. For a successful activity, maybe record only the start and end events (or key state changes) and leave out repetitive in-between logs.<p>Aggregation and summarization: Instead of storing every log line during a spike, aggregate similar or redundant messages into a summarized entry. This not only reduces volume but also highlights trends.
Great article and nice explanation. I believe this describes “Algorithm R” in this paper from Vitter, who was probably the first to describe it: <a href="https://www.cs.umd.edu/~samir/498/vitter.pdf" rel="nofollow">https://www.cs.umd.edu/~samir/498/vitter.pdf</a>
I discovered this in one of those coding quizzes they give you to get a job. I was reviewing questions and one of them was this exact thing. I had no idea how to do it until I read the answer, and then it was obvious.
Very well put together. If you are curious about the weighted version, I tried to explain it some here: <a href="https://gregable.com/2007/10/reservoir-sampling.html" rel="nofollow">https://gregable.com/2007/10/reservoir-sampling.html</a><p>There's also a distributed version, easy with a map reduce.<p>Or the very simple algorithm: generate a random paired for each item in the stream and keep the top N ordered by that random.
This reminds me that I need to spend more time thinking about the algorithm the allies used to count German tanks by serial number. The people in the field estimated about 5x as many tanks as were actually produced but the serial number trick was over 90% accurate.
This is a great post that also illustrates the tradeoffs inherent in telemetry collection (traces, logs, metrics) for analysis. It's a capital-H Hard space to operate in that a lot of developers either don't know about, or take for granted.
I remember this turning up in a google interview back in the day. The interview was really expecting me not to know the algorithm and to flounder about trying to solve the problem from first principles. Was fun to just shortcut things by knowing the answer that time.