My Favorite Algorithm: Linear Time Median Finding (2018)

371 pointsby skanderbm10 months ago

33 comments

danlark10 months ago

Around 4 years ago I compared lots of different median algorithms and the article turned out to be much longer than I anticipated :)<a href="https://danlark.org/2020/11/11/miniselect-practical-and-generic-selection-algorithms/" rel="nofollow">https://danlark.org/2020/11/11/miniselect-practical-and-gene...</a>

评论 #41078878 未加载

评论 #41068529 未加载

rented_mule10 months ago

10-15 years ago, I found myself needing to regularly find the median of many billions of values, each parsed out of a multi-kilobyte log entry. MapReduce was what we were using for processing large amounts of data at the time. With MapReduce over that much data, you don't just want linear time, but ideally single pass, distributed across machines. Subsequent passes over much smaller amounts of data are fine.It was a struggle until I figured out that knowledge of the precision and range of our data helped. These were timings, expressed in integer milliseconds. So they were non-negative, and I knew the 90th percentile was well under a second.As the article mentions, finding a median typically involves something akin to sorting. With the above knowledge, bucket sort becomes available, with a slight tweak in my case. Even if the samples were floating point, the same approach could be used as long as an integer (or even fixed point) approximation that is very close to the true median is good enough, again assuming a known, relatively small range.The idea is to build a dictionary where the keys are the timings in integer milliseconds and the values are a count of the keys' appearance in the data, i.e., a histogram of timings. The maximum timing isn't known, so to ensure the size of the dictionary doesn't get out of control, use the knowledge that the 90th percentile is well under a second and count everything over, say, 999ms in the 999ms bin. Then the dictionary will be limited to 2000 integers (keys in the range 0-999 and corresponding values) - this is the part that is different from an ordinary bucket sort. All of that is trivial to do in a single pass, even when distributed with MapReduce. Then it's easy to get the median from that dictionary / histogram.

评论 #41071519 未加载

评论 #41076347 未加载

评论 #41078086 未加载

评论 #41072891 未加载

评论 #41075434 未加载

xinok10 months ago

> P.S: In 2017 a new paper came out that actually makes the median-of-medians approach competitive with other selection algorithms. Thanks to the paper’s author, Andrei Alexandrescu for bringing it to my attention!He also gave a talk about his algorithm in 2016. He's an entertaining presenter, I highly recommended!There's Treasure Everywhere - Andrei Alexandrescu<a href="https://www.youtube.com/watch?v=fd1_Miy1Clg" rel="nofollow">https://www.youtube.com/watch?v=fd1_Miy1Clg</a>

评论 #41072110 未加载

评论 #41075463 未加载

mabbo10 months ago

I learned about the median-of-medians quickselect algorithm when I was an undergrad and was really impressed by it. I implemented it, and it was terribly slow. It's runtime grew linearly, but that only really mattered if you had at least a few billion items in your list.I was chatting about this with a grad student friend who casually said something like "Sure, it's slow, but what really matters is that it proves that it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible. Now that we do, we know there might an even faster linear algorithm." Really got into the philosophy of what Computer Science is about in the first place.The lesson was so simple yet so profound that I nearly applied to grad school because of it. I have no idea if they even recall the conversation, but it was a pivotal moment of my education.

评论 #41068514 未加载

评论 #41073817 未加载

kwantam10 months ago

One of the fun things about the median-of-medians algorithm is its completely star-studded author list.Manuel Blum - Turing award winner in 1995Robert Floyd - Turing award winner in 1978Ron Rivest - Turing award winner in 2002Bob Tarjan - Turing award winner in 1986 (oh and also the inaugural Nevanlinna prizewinner in 1982)Vaughan Pratt - oh no, the only non-Turing award winner in the list. Oh right but he's emeritus faculty at Stanford, directed the SUN project before it became Sun Microsystems, was instrumental in Sun's early days (director of research and designer of the Sun logo!), and is responsible for all kinds of other awesome stuff (near and dear to me: Pratt certificates of primality).Four independent Turing awards! SPARCstations! This paper has it all.

评论 #41076753 未加载

评论 #41076601 未加载

评论 #41078404 未加载

someplaceguy10 months ago

<pre><code> return l[len(l) / 2] </code></pre> I'm not a Python expert, but doesn't the `/` operator return a float in Python? Why would you use a float as an array index instead of doing integer division (with `//`)?I know this probably won't matter until you have extremely large arrays, but this is still quite a code smell.Perhaps this could be forgiven if you're a Python novice and hadn't realized that the two different operators exist, but this is not the case here, as the article contains this even more baffling code which uses integer division in one branch but float division in the other:<pre><code> def quickselect_median(l, pivot_fn=random.choice): if len(l) % 2 == 1: return quickselect(l, len(l) // 2, pivot_fn) else: return 0.5 * (quickselect(l, len(l) / 2 - 1, pivot_fn) + quickselect(l, len(l) / 2, pivot_fn)) </code></pre> That we're 50 comments in and nobody seems to have noticed this only serves to reinforce my existing prejudice against the average Python code quality.

评论 #41071615 未加载

评论 #41070110 未加载

评论 #41069058 未加载

TacticalCoder10 months ago

I really enjoyed TFA but this:> Technically, you could get extremely unlucky: at each step, you could pick the largest element as your pivot. Each step would only remove one element from the list and you’d actually have O(n2) performance instead of O(n)If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen. If the data is really too big to shuffle, then only shuffle once a bucket is small enough to be shuffled.If you do shuffle, probabilities are here to guarantee that that worst case cannot happen. If anyone says that "technically" it can happen, I'll answer that then "technically" an attacker could also guess correctly every bit of your 256 bits private key.Our world is build on probabilities: all our private keys are protected by the mathematical improbability that someone shall guess them correctly.From what I read, a shuffle followed by quickselect is O(n) for all practical purposes.

评论 #41073897 未加载

评论 #41076305 未加载

furstenheim10 months ago

Floyd Ryvest also does the job . A bit more efficient IIRC.However I never managed to understand how it works.<a href="https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorithm" rel="nofollow">https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorit...</a>

throwaway29453110 months ago

If you're selecting the n:th element, where n is very small (or large), using median-of-medians may not be the best choice.Instead, you can use a biased pivot as in [1] or something I call "j:th of k:th". Floyd-Rivest can also speed things up. I have a hobby project that gets 1.2-2.0x throughput when compared to a well implemented quickselect, see: <a href="https://github.com/koskinev/turboselect">https://github.com/koskinev/turboselect</a>If anyone has pointers to fast generic & in-place selection algorithms, I'm interested.[1] <a href="https://doi.org/10.4230/LIPIcs.SEA.2017.24" rel="nofollow">https://doi.org/10.4230/LIPIcs.SEA.2017.24</a>

mgaunard10 months ago

You could also use one of the streaming algorithms which allow you to compute approximations for arbitrary quantiles without ever needing to store the whole data in memory.

评论 #41067292 未加载

评论 #41067634 未加载

anonymoushn10 months ago

One commonly sees the implication that radix sort cannot be used for data types other than integers, or for composite data types, or for large data types. For example, TFA says you could use radix sort if your input is 32-bit integers. But you can use it on anything. You can use radix sort to sort strings in O(n) time.

评论 #41066934 未加载

评论 #41066958 未加载

ncruces10 months ago

An implementation in Go, that's (hopefully) simple enough to be understandable, yet minimally practical:<a href="https://github.com/ncruces/sort/blob/main/quick/quick.go">https://github.com/ncruces/sort/blob/main/quick/quick.go</a>

Xcelerate10 months ago

I received a variant of this problem as an interview question a few months ago. Except the linear time approach would not have worked here, since the list contains trillions of numbers, you only have sequential read access, and the list cannot be loaded into memory. 30 minutes — go.First I asked if anything could be assumed about the statistics on the distribution of the numbers. Nope, could be anything, except the numbers are 32-bit ints. After fiddling around for a bit I finally decided on a scheme that creates a bounding interval for the unknown median value (one variable contains the upper bound and one contains the lower bound based on 2^32 possible values) and then adjusts this interval on each successive pass through the data. The last step is to average the upper and lower bound in case there are an odd number of integers. Worst case, this approach requires O(log n) passes through the data, so even for trillions of numbers it’s fairly quick.I wrapped up the solution right at the time limit, and my code ran fine on the test cases. Was decently proud of myself for getting a solution in the allotted time.Well, the interview feedback arrived, and it turns out my solution was rejected for being suboptimal. Apparently there is a more efficient approach that utilizes priority heaps. After looking up and reading about the priority heap approach, all I can say is that I didn’t realize the interview task was to re-implement someone’s PhD thesis in 30 minutes...I had never used leetcode before because I never had difficulty with prior coding interviews (my last job search was many years before the 2022 layoffs), but after this interview, I immediately signed up for a subscription. And of course the “median file integer” question I received is one of the most asked questions on the list of “hard” problems.

评论 #41077104 未加载

评论 #41067470 未加载

评论 #41067399 未加载

评论 #41075547 未加载

评论 #41067259 未加载

评论 #41081496 未加载

评论 #41067712 未加载

jagged-chisel10 months ago

It's quicksort with a modification to select the median during the process. I feel like this is a good way to approach lots of "find $THING in list" questions.

评论 #41077347 未加载

someplaceguy10 months ago

I found this part of the code quite funny:<pre><code> # If there are < 5 items, just return the median if len(l) < 5: # In this case, we fall back on the first median function we wrote. # Since we only run this on a list of 5 or fewer items, it doesn't # depend on the length of the input and can be considered constant # time. return nlogn_median(l) </code></pre> Hell, why not just use 2^140 instead of 5 as the cut-off point, then? This way you'd have constant time median finding for all arrays that can be represented in any real-world computer! :) [1][1] According to <a href="https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/" rel="nofollow">https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/</a>

评论 #41069146 未加载

评论 #41069010 未加载

ignoramous10 months ago

If an approximation is enough, the p2 quantile estimator (O(1) memory) is pretty neat: <a href="https://news.ycombinator.com/item?id=25201093">https://news.ycombinator.com/item?id=25201093</a>

saagarjha10 months ago

This is hinted at in the post but if you're using C++ you will typically have access to quickselect via std::nth_element. I've replaced many a sort with that in code review :) (Well, not many. But at least a handful.)

评论 #41067507 未加载

chpatrick10 months ago

Another nice one is O(1) weighted sampling (after O(n) preprocessing).<a href="https://en.wikipedia.org/wiki/Alias_method" rel="nofollow">https://en.wikipedia.org/wiki/Alias_method</a>

melonmouse10 months ago

The linked proof for that median of medians is O(n) feels counterintuitive to me. Here's a (simpler?) alternative.<pre><code> T(0) = 0 T(1) = 1 T(n) = n + T(n/5) + T(7/10*n) </code></pre> We want to prove that:<pre><code> T(n) ≤ C*n </code></pre> It is intuitive that T(a+b) ≥ T(a) + T(b), or in other words, T is superadditive. That can be shown by induction:Induction base: it holds for all a+b < 1, the only case being a=0, b=0:<pre><code> T(0+0) = 0 + T(0) + T(0) ≥ T(0) + T(0) </code></pre> Induction step: suppose it holds for all a+b < k. Let a+b = k.<pre><code> T(a+b) = T(k) = k + T(k/5) + T(7/10*k) ≥ k + T(a/5) + T(b/5) + T(7/10*a) + T(7/10*b) = [a + T(a/5) + T(7/10*a)] + [b + T(b/5) + T(7/10*b)] = T(a) + T(b) </code></pre> Because T is superadditive:<pre><code> T(n) = n + T(n/5) + T(7/10*n) ≤ n + T(n/5 + 7/10*n) = n + T(9/10*n) </code></pre> Now we can apply the master theorem. Or to write out the proof (using a geometric series):<pre><code> T(n) ≤ n + T(9/10*n) ≤ n * ∑ᵢ₌₀ᶦⁿᶠᶦⁿᶦᵗʸ (9/10)^i = n * 1/(1-9/10) = 10*n </code></pre> So, we have shown the algorithm is O(n) with C=10 (or less).

评论 #41085939 未加载

hammeiam10 months ago

The "Split the array into subarrays of length 5, now sorting all of the arrays is O(n) instead of O(n log n)" feels like cheating to me

评论 #41069003 未加载

评论 #41068793 未加载

评论 #41070565 未加载

评论 #41070549 未加载

评论 #41069740 未加载

评论 #41069700 未加载

评论 #41069054 未加载

Someone10 months ago

FTA:“Proof of Average O(n)On average, the pivot will split the list into 2 approximately equal-sized pieces. Therefore, each subsequent recursion operates on 1⁄2 the data of the previous step.”That “therefore” doesn’t follow, so this is more an intuition than a proof. The problem with it is that the medium is more likely to end up in the larger of the two pieces, so you more likely have to recurse on the larger part than on the smaller part.What saves you is that O(n) doesn’t say anything about constants.Also, I would think you can improve things a bit for real world data by, on subsequent iterations, using the average of the set as pivot (You can compute that for both pieces on the fly while doing the splitting. The average may not be in the set of items, but that doesn’t matter for this algorithm). Is that true?

评论 #41072112 未加载

sfpotter10 months ago

A nice way to approximate the median: <a href="https://www.stat.berkeley.edu/~ryantibs/papers/median.pdf" rel="nofollow">https://www.stat.berkeley.edu/~ryantibs/papers/median.pdf</a>

RcouF1uZ4gsC10 months ago

> The C++ standard library uses an algorithm called introselect which utilizes a combination of heapselect and quickselect and has an O(nlogn) bound.Introselect is a combination of Quickselect and Median of Medians and is O(n) worst case.

Tarean10 months ago

Love this algorithm. It feels like magic, and then it feels obvious and basically like binary search.Similar to the algorithm to parallelize the merge step of merge sort. Split the two sorted sequences into four sequences so that `merge(left[0:leftSplit], right[0:rightSplit])+merge(left[leftSplit:], right[rightSplit:])` is sorted. leftSplit+rightSplit should be halve the total length, and the elements in the left partition must be <= the elements in the right partition.Seems impossible, and then you think about it and it's just binary search.

teo_zero10 months ago

> On average, the pivot will split the list into 2 approximately equal-sized pieces.Where does this come from?Even assuming a perfect random function, this would be true only for distributions that show some symmetry. But if the input is all 10s and one 5, each step will generate quite different-sized pieces!

评论 #41076391 未加载

runiq10 months ago

Why is it okay to drop not-full chunks? The article doesn't explain that and I'm stupid.Edit: I just realized that the function where non-full chunks are dropped is just the one for finding the pivot, not the one for finding the median. I understand now.

ValleZ10 months ago

I was asked to invent this algorithm on a whiteboard in 30 minutes. Loved it.

beyondCritics10 months ago

<It’s not straightforward to prove why this is O(n).Replace T(n/5) with T(floor(n/5)) and T(7n/10) with T(floor(7n/10)) and show by induction that T(n) <= 10n for all n.

kccqzy10 months ago

> Quickselect gets us linear performance, but only in the average case. What if we aren’t happy to be average, but instead want to guarantee that our algorithm is linear time, no matter what?I don't agree with the need for this guarantee. Note that the article already says the selection of the pivot is by random. You can simply choose a very good random function to avoid an attacker crafting an input that needs quadratic time. You'll never be unlucky enough for this to be a problem. This is basically the same kind of mindset that leads people into thinking, what if I use SHA256 to hash these two different strings to get the same hash?

评论 #41070676 未加载

评论 #41070735 未加载

zelphirkalt10 months ago

You can simply pass once over the data, and while you do that, count occurrences of the elements, memorizing the last maximum. Whenever an element is counted, you check, if that count is now higher than the previous maximum. If it is, you memorize the element and its count as the maximum, of course. Very simple approach and linear in time, with minimal book keeping on the way (only the median element and the count (previous max)).I don't find it surprising or special at all, that finding the median works in linear time, since even this ad-hoc thought of way is in linear time.EDIT: Ah right, I mixed up mode and median. My bad.

评论 #41068483 未加载

vismit200010 months ago

This is covered in section 9.3 in CLRS book - Medians and Order Statistics

SkiFire1310 months ago

I wonder what's the reason of picking groups of 5 elements instead of 2 or 8.

评论 #41069150 未加载

评论 #41069018 未加载

nilslindemann10 months ago

"ns" instead of "l" and "n" instead of "el" would have been my choice (seen in Haskell code).

评论 #41068241 未加载

33 comments

danlark10 months ago

评论 #41078878 未加载

评论 #41068529 未加载

rented_mule10 months ago

评论 #41071519 未加载

评论 #41076347 未加载

评论 #41078086 未加载

评论 #41072891 未加载

评论 #41075434 未加载

xinok10 months ago

评论 #41072110 未加载

评论 #41075463 未加载

mabbo10 months ago

评论 #41068514 未加载

评论 #41073817 未加载

kwantam10 months ago

评论 #41076753 未加载

评论 #41076601 未加载

评论 #41078404 未加载

someplaceguy10 months ago

评论 #41071615 未加载

评论 #41070110 未加载

评论 #41069058 未加载

TacticalCoder10 months ago

评论 #41073897 未加载

评论 #41076305 未加载

furstenheim10 months ago

throwaway29453110 months ago

mgaunard10 months ago

You could also use one of the streaming algorithms which allow you to compute approximations for arbitrary quantiles without ever needing to store the whole data in memory.

评论 #41067292 未加载

评论 #41067634 未加载

anonymoushn10 months ago

评论 #41066934 未加载

评论 #41066958 未加载

ncruces10 months ago

Xcelerate10 months ago

评论 #41077104 未加载

评论 #41067470 未加载

评论 #41067399 未加载

评论 #41075547 未加载

评论 #41067259 未加载

评论 #41081496 未加载

评论 #41067712 未加载

jagged-chisel10 months ago

It's quicksort with a modification to select the median during the process. I feel like this is a good way to approach lots of "find $THING in list" questions.

评论 #41077347 未加载

someplaceguy10 months ago

评论 #41069146 未加载

评论 #41069010 未加载

ignoramous10 months ago

If an approximation is enough, the p2 quantile estimator (O(1) memory) is pretty neat: <a href="https://news.ycombinator.com/item?id=25201093">https://news.ycombinator.com/item?id=25201093</a>

saagarjha10 months ago

评论 #41067507 未加载

chpatrick10 months ago

Another nice one is O(1) weighted sampling (after O(n) preprocessing).<a href="https://en.wikipedia.org/wiki/Alias_method" rel="nofollow">https://en.wikipedia.org/wiki/Alias_method</a>

melonmouse10 months ago

评论 #41085939 未加载

hammeiam10 months ago

The "Split the array into subarrays of length 5, now sorting all of the arrays is O(n) instead of O(n log n)" feels like cheating to me

评论 #41069003 未加载

评论 #41068793 未加载

评论 #41070565 未加载

评论 #41070549 未加载

评论 #41069740 未加载

评论 #41069700 未加载

评论 #41069054 未加载

Someone10 months ago

评论 #41072112 未加载

sfpotter10 months ago

A nice way to approximate the median: <a href="https://www.stat.berkeley.edu/~ryantibs/papers/median.pdf" rel="nofollow">https://www.stat.berkeley.edu/~ryantibs/papers/median.pdf</a>

RcouF1uZ4gsC10 months ago

Tarean10 months ago

teo_zero10 months ago

评论 #41076391 未加载

runiq10 months ago

ValleZ10 months ago

I was asked to invent this algorithm on a whiteboard in 30 minutes. Loved it.

beyondCritics10 months ago

<It’s not straightforward to prove why this is O(n).Replace T(n/5) with T(floor(n/5)) and T(7n/10) with T(floor(7n/10)) and show by induction that T(n) <= 10n for all n.

kccqzy10 months ago

评论 #41070676 未加载

评论 #41070735 未加载

zelphirkalt10 months ago

评论 #41068483 未加载

vismit200010 months ago

This is covered in section 9.3 in CLRS book - Medians and Order Statistics

SkiFire1310 months ago

I wonder what's the reason of picking groups of 5 elements instead of 2 or 8.

评论 #41069150 未加载

评论 #41069018 未加载

nilslindemann10 months ago

"ns" instead of "l" and "n" instead of "el" would have been my choice (seen in Haskell code).

评论 #41068241 未加载