Beating TimSort at Merging

282 pointsby andrewstetsenkoalmost 4 years ago

15 comments

tim-petersalmost 4 years ago

To get more gonzo, in CPython's list.sort(), the C code that actually merges two lists (which happen, in context, to be two contiguous array slices) is in listobject.c's `merge_at(()` function. That brings in the world of "galloping" (exponential search) optimizations, much faster than one-pair-at-a-time compares in many real-world cases.So that's a whole lot of additional complication, but it's the heart of what _really_ makes timsort shine in its "jaw dropping" cases.Tim (of "timsort" - which was an inside joke name at the start, because I never expected anything outside of CPython would use it ;-) ).

评论 #27826413 未加载

评论 #27826277 未加载

评论 #27827213 未加载

评论 #27827392 未加载

jhokansonalmost 4 years ago

As someone that rarely works with Python lists, as opposed to numpy arrays, I was pleasantly surprised to see numpy does what I would expect in providing a mergesort option. I'm surprised Python doesn't, other than via heapq and only implemented in Python (according to my reading of the post and a very quick Google search).Oops, just for fun the numpy documentation currently states: "The datatype determines which of ‘mergesort’ or ‘timsort’ is actually used, even if ‘mergesort’ is specified. User selection at a finer scale is not currently available." Awesome ...Also, apparently mergesort may also be done using radix sort for integers "‘mergesort’ and ‘stable’ are mapped to radix sort for integer data types."

评论 #27824199 未加载

评论 #27824018 未加载

评论 #27824385 未加载

CGamesPlayalmost 4 years ago

I definitely used this to my advantage in ACM ICPC contests. Most of the students and professors wrote the problems and benchmarked against Java. However, on several problems, it boiled down to "write this specific algorithm for this class of problem, or use C." I once netted my team an extra problem after we timed out in Java and I just rewrote the teammate's solution in C++.

评论 #27831502 未加载

评论 #27826025 未加载

dangalmost 4 years ago

Past TimSort threads, for anyone interested:Timsort, the Python sorting algorithm - <a href="https://news.ycombinator.com/item?id=21196555" rel="nofollow">https://news.ycombinator.com/item?id=21196555</a> - Oct 2019 (131 comments)On the Worst-Case Complexity of TimSort - <a href="https://news.ycombinator.com/item?id=17883461" rel="nofollow">https://news.ycombinator.com/item?id=17883461</a> - Aug 2018 (74 comments)Timsort is a sorting algorithm that is efficient for real-world data - <a href="https://news.ycombinator.com/item?id=17436591" rel="nofollow">https://news.ycombinator.com/item?id=17436591</a> - July 2018 (77 comments)Functional verification with mechanical proofs of TimSort [pdf] - <a href="https://news.ycombinator.com/item?id=9778243" rel="nofollow">https://news.ycombinator.com/item?id=9778243</a> - June 2015 (1 comment)Timsort - <a href="https://news.ycombinator.com/item?id=3214527" rel="nofollow">https://news.ycombinator.com/item?id=3214527</a> - Nov 2011 (27 comments)Java has switched from Mergesort to TimSort - <a href="https://news.ycombinator.com/item?id=752677" rel="nofollow">https://news.ycombinator.com/item?id=752677</a> - Aug 2009 (13 comments)

评论 #27830701 未加载

评论 #27827798 未加载

CogitoCogitoalmost 4 years ago

It seems like the built in heapq.merge() should be as fast as the implementation here. I mean if it's really just merging sorted lists, why wouldn't it be as fast as this implementation?edit: Okay one reason is that heapq.merge is written in python:<a href="https://github.com/python/cpython/blob/6252670732c68420c2a8b3f0259559e45eed7610/Lib/heapq.py#L314-L392" rel="nofollow">https://github.com/python/cpython/blob/6252670732c68420c2a8b...</a>It might also be since the python version has more options (like arbitrarily many iterables), but I presume it's the fact that the extension is written in C.

评论 #27823985 未加载

评论 #27823991 未加载

评论 #27823895 未加载

agbellalmost 4 years ago

TimSort is cool. In it's worst case it looks no better than merge sort, but give it some partially order data, or small amounts of data, where it uses insertion sort, and it really shines.

lyxellalmost 4 years ago

Interesting read. This post seems to be written by Adam Gordon Bell who is also the host of the excellect CoRecursive podcast. Recently featured at HN in The Untold Story of SQLite [0].[0]: <a href="https://news.ycombinator.com/item?id=27718701" rel="nofollow">https://news.ycombinator.com/item?id=27718701</a>

评论 #27824150 未加载

danbrucalmost 4 years ago

The first Python implementation is bad - removing the first element in each iteration is O(n), the C implementation gets this right by maintaining one index into each list instead of modifying the lists.

评论 #27825025 未加载

评论 #27825115 未加载

thomas536almost 4 years ago

Are there similar relatively simple methods or simd tricks for multi-way merges?

评论 #27823986 未加载

pg_botalmost 4 years ago

I was thinking about how to make sorting faster the other day and was toying around with the idea of a permutation sort. If you think about a list of items generally (at least) one of the permutations of the list is guaranteed to be the correct sort order. Theoretically you should be able to do a binary search to find a correct permutation since there should always be a comparison you can do to reduce the search space by half. You should be able to go faster if there is duplicated data since there are multiple permutations of the list that would work.

评论 #27830707 未加载

评论 #27828157 未加载

kzrdudealmost 4 years ago

Could Cython be used for writing the fast module here instead?

rkallaalmost 4 years ago

Excellent read.

sam1ralmost 4 years ago

This is awesome. Thanks!

benatkinalmost 4 years ago

Doesn't explain why Earthly.dev is under the BSL.

评论 #27826507 未加载

forrestthewoodsalmost 4 years ago

Great post. I almost did a spit take when I read:> Timsort overcomes this disadvantage by being written in C rather than Python.Yeah, Python is SLOW. Thankfully the author dug into C. That was nice to see.Edit: lol I have no idea why this is being downvoted. Y’all weird.

评论 #27825096 未加载