Reservoir Sampling

252 pointsby chrisdemarcoabout 7 hours ago

16 comments

We lived in a rural area when I was a kid. My dad told me once that his buddy had to measure the ptarmigan[1] population in the mountains each year as part of his job.He did this by hiking a fixed route, and at fixed intervals scare the birds so they would fly and count.The total count was submitted to some office which used it to estimate the population.One year he had to travel abroad when the counting had to be done, so he recruited a friend and explained in detail how to do it.However when the day of the counting arrived his friend forgot, and it was a huge hassle anyway so he just submitted a number he figured was about right, and that was that.Then one day the following year, the local newspaper had a frontpage headline stating "record increase in ptarmigan population".The reason it was big news was that the population estimate was used to set the hunting quotas, something his friend had not considered...[1]: <a href="https://en.wikipedia.org/wiki/Rock_ptarmigan" rel="nofollow">https://en.wikipedia.org/wiki/Rock_ptarmigan</a>

评论 #43932547 未加载

hansvm8 minutes ago

This is a great post, very approachable, with excellent visualizations.We use a variation of this sort of thing at $WORK to solve a related problem, where you want to estimate some percentile from a running stream, with the constraints that the percentile you want to choose changes from time to time but is generally static for a trillion or more iterations (and the underlying data is quasi-stationary). If you back the process by a splay tree, you can get amortized O(1) percentile estimates (higher error bars for a given RAM consumption than a number of other techniques, but very fast).You can also play with the replacement probability to, e.g., have a "data half-life" (denominated either in time or discrete counts) and bias the estimate toward recent events, which is more suitable for some problems.

samwhoabout 7 hours ago

Hello! o/I’m the author of this post. Happy to answer any questions, and love to get feedback.The code for all of my posts can be found at <a href="https://github.com/samwho/visualisations">https://github.com/samwho/visualisations</a> and is MIT licensed, so you’re welcome to use it :)

评论 #43931829 未加载

评论 #43929983 未加载

评论 #43929589 未加载

评论 #43929189 未加载

评论 #43930713 未加载

评论 #43931625 未加载

评论 #43930582 未加载

评论 #43931094 未加载

评论 #43929746 未加载

评论 #43929821 未加载

sadiqabout 6 hours ago

This is a really nicely written and illustrated post.An advanced extension to this is that there are algorithms which calculate the number of records to skip rather than doing a trial per record. This has a good write-up of them: <a href="https://richardstartin.github.io/posts/reservoir-sampling" rel="nofollow">https://richardstartin.github.io/posts/reservoir-sampling</a>

justanotheratomabout 4 hours ago

Great article and explanation.On a practical level though, this would be the last thing I would use for log collection. I understand that when there is a spike, something has to be dropped. What should this something be?I don't see the point of being "fair" about what is dropped.I would use fairness as a last resort, after trying other things:Drop lower priority logs: If your log messages have levels (debug, info, warning, error), prioritize higher-severity events, discarding the verbose/debug ones first.Contextual grouping: Treat a sequence of logs as parts of an activity. For a successful activity, maybe record only the start and end events (or key state changes) and leave out repetitive in-between logs.Aggregation and summarization: Instead of storing every log line during a spike, aggregate similar or redundant messages into a summarized entry. This not only reduces volume but also highlights trends.

评论 #43930931 未加载

评论 #43931380 未加载

stygiansonicabout 6 hours ago

Great article and nice explanation. I believe this describes “Algorithm R” in this paper from Vitter, who was probably the first to describe it: <a href="https://www.cs.umd.edu/~samir/498/vitter.pdf" rel="nofollow">https://www.cs.umd.edu/~samir/498/vitter.pdf</a>

评论 #43931683 未加载

tanvachabout 2 hours ago

From data science perspective, the volume of the data also encodes really valuable information, so it’s good to also log the number of data points each one represents. For example, if sampling rate comes out to be 10%, have a field that encodes 10. This way you can rebuild and estimate most statistics like count, sum, average, etc.

gregableabout 3 hours ago

Very well put together. If you are curious about the weighted version, I tried to explain it some here: <a href="https://gregable.com/2007/10/reservoir-sampling.html" rel="nofollow">https://gregable.com/2007/10/reservoir-sampling.html</a>There's also a distributed version, easy with a map reduce.Or the very simple algorithm: generate a random paired for each item in the stream and keep the top N ordered by that random.

jbellisabout 1 hour ago

Used in Dropwizard Metrics, among other places. <a href="https://metrics.dropwizard.io/4.2.0/" rel="nofollow">https://metrics.dropwizard.io/4.2.0/</a>

hinkleyabout 5 hours ago

This reminds me that I need to spend more time thinking about the algorithm the allies used to count German tanks by serial number. The people in the field estimated about 5x as many tanks as were actually produced but the serial number trick was over 90% accurate.

评论 #43930154 未加载

phillipcarterabout 6 hours ago

This is a great post that also illustrates the tradeoffs inherent in telemetry collection (traces, logs, metrics) for analysis. It's a capital-H Hard space to operate in that a lot of developers either don't know about, or take for granted.

评论 #43931344 未加载

lordnachoabout 3 hours ago

I discovered this in one of those coding quizzes they give you to get a job. I was reviewing questions and one of them was this exact thing. I had no idea how to do it until I read the answer, and then it was obvious.

wood_spiritabout 7 hours ago

I remember this turning up in a google interview back in the day. The interview was really expecting me not to know the algorithm and to flounder about trying to solve the problem from first principles. Was fun to just shortcut things by knowing the answer that time.

评论 #43929006 未加载

pixelbeatabout 5 hours ago

FWIW GNU coreutils' shuf uses reservoir sampling for larger inputs to give bounded memory operation

foxbeeabout 5 hours ago

Wonderful illustrations and writing. Real interesting read.

t55about 6 hours ago

great article!