テックエコー

13 comments

We lived in a rural area when I was a kid. My dad told me once that his buddy had to measure the ptarmigan[1] population in the mountains each year as part of his job.He did this by hiking a fixed route, and at fixed intervals scare the birds so they would fly and count.The total count was submitted to some office which used it to estimate the population.One year he had to travel abroad when the counting had to be done, so he recruited a friend and explained in detail how to do it.However when the day of the counting arrived his friend forgot, and it was a huge hassle anyway so he just submitted a number he figured was about right, and that was that.Then one day the following year, the local newspaper had a frontpage headline stating "record increase in ptarmigan population".The reason it was big news was that the population estimate was used to set the hunting quotas, something his friend had not considered...[1]: <a href="https://en.wikipedia.org/wiki/Rock_ptarmigan" rel="nofollow">https://en.wikipedia.org/wiki/Rock_ptarmigan</a>

samwho約4時間前

Hello! o/I’m the author of this post. Happy to answer any questions, and love to get feedback.The code for all of my posts can be found at <a href="https://github.com/samwho/visualisations">https://github.com/samwho/visualisations</a> and is MIT licensed, so you’re welcome to use it :)

评论 #43931625 未加载

评论 #43929983 未加载

评论 #43930713 未加载

评论 #43929189 未加载

评论 #43929589 未加载

评论 #43930582 未加载

评论 #43931094 未加载

评论 #43929746 未加载

评论 #43929821 未加载

sadiq約4時間前

This is a really nicely written and illustrated post.An advanced extension to this is that there are algorithms which calculate the number of records to skip rather than doing a trial per record. This has a good write-up of them: <a href="https://richardstartin.github.io/posts/reservoir-sampling" rel="nofollow">https://richardstartin.github.io/posts/reservoir-sampling</a>

justanotheratom約2時間前

Great article and explanation.On a practical level though, this would be the last thing I would use for log collection. I understand that when there is a spike, something has to be dropped. What should this something be?I don't see the point of being "fair" about what is dropped.I would use fairness as a last resort, after trying other things:Drop lower priority logs: If your log messages have levels (debug, info, warning, error), prioritize higher-severity events, discarding the verbose/debug ones first.Contextual grouping: Treat a sequence of logs as parts of an activity. For a successful activity, maybe record only the start and end events (or key state changes) and leave out repetitive in-between logs.Aggregation and summarization: Instead of storing every log line during a spike, aggregate similar or redundant messages into a summarized entry. This not only reduces volume but also highlights trends.

评论 #43930931 未加载

评论 #43931380 未加载

stygiansonic約4時間前

Great article and nice explanation. I believe this describes “Algorithm R” in this paper from Vitter, who was probably the first to describe it: <a href="https://www.cs.umd.edu/~samir/498/vitter.pdf" rel="nofollow">https://www.cs.umd.edu/~samir/498/vitter.pdf</a>

评论 #43931683 未加载

lordnacho26分前

I discovered this in one of those coding quizzes they give you to get a job. I was reviewing questions and one of them was this exact thing. I had no idea how to do it until I read the answer, and then it was obvious.

gregable約1時間前

Very well put together. If you are curious about the weighted version, I tried to explain it some here: <a href="https://gregable.com/2007/10/reservoir-sampling.html" rel="nofollow">https://gregable.com/2007/10/reservoir-sampling.html</a>There's also a distributed version, easy with a map reduce.Or the very simple algorithm: generate a random paired for each item in the stream and keep the top N ordered by that random.

hinkley約3時間前

This reminds me that I need to spend more time thinking about the algorithm the allies used to count German tanks by serial number. The people in the field estimated about 5x as many tanks as were actually produced but the serial number trick was over 90% accurate.

评论 #43930154 未加载

phillipcarter約4時間前

This is a great post that also illustrates the tradeoffs inherent in telemetry collection (traces, logs, metrics) for analysis. It's a capital-H Hard space to operate in that a lot of developers either don't know about, or take for granted.

评论 #43931344 未加载

wood_spirit約4時間前

I remember this turning up in a google interview back in the day. The interview was really expecting me not to know the algorithm and to flounder about trying to solve the problem from first principles. Was fun to just shortcut things by knowing the answer that time.

评论 #43929006 未加载

pixelbeat約3時間前

FWIW GNU coreutils' shuf uses reservoir sampling for larger inputs to give bounded memory operation

foxbee約3時間前

Wonderful illustrations and writing. Real interesting read.

t55約4時間前

great article!

13 comments

magicalhippo約4時間前

samwho約4時間前

评论 #43931625 未加载

评论 #43929983 未加载

评论 #43930713 未加载

评论 #43929189 未加载

评论 #43929589 未加载

评论 #43930582 未加载

评论 #43931094 未加载

评论 #43929746 未加载

评论 #43929821 未加载

sadiq約4時間前

justanotheratom約2時間前

评论 #43930931 未加载

评论 #43931380 未加载

stygiansonic約4時間前

评论 #43931683 未加载

lordnacho26分前

gregable約1時間前

hinkley約3時間前

评论 #43930154 未加载

phillipcarter約4時間前

评论 #43931344 未加载

wood_spirit約4時間前

评论 #43929006 未加载

pixelbeat約3時間前

FWIW GNU coreutils' shuf uses reservoir sampling for larger inputs to give bounded memory operation

foxbee約3時間前

Wonderful illustrations and writing. Real interesting read.

t55約4時間前

great article!

Reservoir Sampling

13 comments

Reservoir Sampling

13 comments