科技回声

9 条评论

tadkar超过 4 年前

Related to this, one of my favourite articles [1] suggests that it’s sufficient to only use one or two pieces of memory to get good estimates. Here’s a pretty amazing result from the paper. To estimate the median (loosely) on a stream, first set the median estimate m to 0 prior to seeing any data. Then as you observe the stream, increase the median estimate m by 1 if the current element s is bigger than m. Do nothing if the current element is the same as m. Decrease the estimate by 1 if the element is less than m. Then (given the right conditions) this process converges to the median. You can even extend this to quantiles by flipping a biased coin and then updating based on the result of the flip as well as the element comparison.The unfortunately now deceased neustar research blog had an interactive widget as well an outstanding write up[2][1]<a href="https://arxiv.org/abs/1407.1121" rel="nofollow">https://arxiv.org/abs/1407.1121</a>[2] <a href="http://content.research.neustar.biz/blog/frugal.html" rel="nofollow">http://content.research.neustar.biz/blog/frugal.html</a>

评论 #25202548 未加载

评论 #25202692 未加载

评论 #25203172 未加载

评论 #25203404 未加载

评论 #25203491 未加载

superyesh超过 4 年前

Great article! One of my favorite data-structures lately is the <a href="https://github.com/tdunning/t-digest" rel="nofollow">https://github.com/tdunning/t-digest</a>. Would be great to see how it compares and what additional accuracy we get but storing some of the data.

codezero超过 4 年前

One of my favorite problems is solving median based on a stream of data that you can replay, but only using, something like 3 variables of the same type of the data (which I think needs to be integer) from the book Numerical Recipes in FORTRAN.It’s to optimize finding the median from a tape drive in the 70s, but the technique is pretty cute.You make a guess at the median, then basically sum how many are over, and under that value, then guess again until you have an even split.It’s also a problem I like to use to nerd snipe people, because why not.Check my work, because it’s been a long time since I worked with FORTRAN and I could have just remembered incorrectly :)<a href="https://github.com/jonlighthall/nrf77/blob/master/mdian2.f" rel="nofollow">https://github.com/jonlighthall/nrf77/blob/master/mdian2.f</a>

_delirium超过 4 年前

I don't know much about quantile estimation, so possibly dumb question: why, in the experiments plotted here, does every estimator on every tested distribution underestimate the true median? I expected some errors, but not all systematically in the same direction.

mike_steph超过 4 年前

Hi all, in case interesting/useful: some research I've been involved in takes a different approach to online quantile estimation (our work is based on Hermite series estimators) and compares favorably to online quantile estimation approaches based on stochastic approximation for example. We've recently released the code:<a href="https://github.com/MikeJaredS/hermiter" rel="nofollow">https://github.com/MikeJaredS/hermiter</a>I've done some comparisons versus the P^2 algorithm (using the OnlineStats implementation in Julia) and the Hermite series based algorithm appears to have comparable accuracy in the tests conducted. The Hermite based approach has the advantage that it estimates the full quantile function though, so arbitrary quantiles can be obtained at any point in time.

convolvatron超过 4 年前

<a href="http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf" rel="nofollow">http://infolab.stanford.edu/~datar/courses/cs361a/papers/qua...</a>I've found Greenwald-Khanna to be a lot more accurate, and also quite suitable for turning into a visual PDF representation

评论 #25206571 未加载

yandie超过 4 年前

Have you done a comparison with <a href="https://datasketches.apache.org/'s" rel="nofollow">https://datasketches.apache.org/'s</a> KLL algorithm?

评论 #25207476 未加载

CamperBob2超过 4 年前

Interesting problem. Access to the last n values isn't usually a problem in my own statistical applications, since they are fully RAM-resident to begin with, but it's annoying to have to sort N recent values to find the median in an N-wide window. Anyone aware of good shortcuts for that?

zjs超过 4 年前

I always appreciate alternatives to using an arithmetic mean.

9 条评论

tadkar超过 4 年前

评论 #25202548 未加载

评论 #25202692 未加载

评论 #25203172 未加载

评论 #25203404 未加载

评论 #25203491 未加载

superyesh超过 4 年前

codezero超过 4 年前

_delirium超过 4 年前

mike_steph超过 4 年前

convolvatron超过 4 年前

评论 #25206571 未加载

yandie超过 4 年前

Have you done a comparison with <a href="https://datasketches.apache.org/'s" rel="nofollow">https://datasketches.apache.org/'s</a> KLL algorithm?

评论 #25207476 未加载

CamperBob2超过 4 年前

zjs超过 4 年前

I always appreciate alternatives to using an arithmetic mean.

P² quantile estimator – estimating the median without storing values

9 条评论

P² quantile estimator – estimating the median without storing values

9 条评论