TechEcho

7 comments

titzerabout 7 years ago

Clickbaity paper title. That raised my hackles a bit.This is basically bucket sort with machine learning thrown in. It is not O(N). O(N) has a very specific mathematical meaning, do not abuse it. It means there exists some fixed K that for every N and every input of size N, the running time is <= K*N. This algorithm does not have that property, so it is not O(N). It might be expected linear running time, but it is not O(N).From the conclusion: "For some distributions with less smoothness, it might be difficult to learn the distribution, and if the training fails...the algorithm might fail."Grr.

评论 #17072311 未加载

评论 #17072410 未加载

yorwbaabout 7 years ago

This seems like a fairly natural generalization of the idea in "The Case for Learned Index Structures" [1]: speed up some classical algorithm based on heuristics by modeling the data with a neural network that can be tuned to the specific task that the general algorithm is applied to.That said, the O(N) claim is purely sensational. It all depends on being able to determine the distribution parameters with sufficient accuracy in O(N) time, which isn't really guaranteed for all kinds of data you might want to sort.[1] <a href="https://arxiv.org/abs/1712.01208" rel="nofollow">https://arxiv.org/abs/1712.01208</a>

CarolineWabout 7 years ago

I've only briefly skimmed the paper, but here's something I've come up with, based on what I read, that might or might not be the same as what they've said, but I think "has legs":* Randomly choose some fixed sized subset of the data and sort it;* Based on that, deduce the distribution of the data;* Based on that, create a destination array, and copy things from the input to roughly the expected right place in the output;* Run around and fix things.When sorting numbers, that feels like it has a chance of running quite quickly. For other objects with some other comparison function, not sure how it would work.

评论 #17064337 未加载

评论 #17072206 未加载

dspillettabout 7 years ago

Looks interesting after a quick skim, and for simpler datasets "machine learning" shouldn't be needed just some more traditional statistical analysis. I'll have to find time to read it fully and play with the idea.It'll be interesting to see how the method fairs with messy real-world data over many examples [O(n) is obviously the best case, which is also the best case of the infamous bubble sort if early exit is implemented, how commonly it achieves that performance or close to it is key]. It seems to be a two stage process: the first being to more-or-less sort the input using a simple model of the expected distribution, the second being a tidying exercise that is essentially a more traditional sort.

phzhqabout 7 years ago

I'm the author of this paper. I didn't realize this paper cause so many intensively discussions here until I heard from my friends. I really appreciate your advice. I would like to explain some questions about this paper.First, this method is not based on comparsion, I definitely know the lower boundary of comparsion sorting is NlogN. The idea of this algorithm is that when we sort big data, the data may have some statistical properties cause it so large, and if we choose some of them, the sub-dataset should hold the properties, this is the basic assumption. I know this assumption is strong, but it is reasonable for big data. According to the paper, we transform the sorting problem to a mapping problem, so we learn the cumulative distribution and predict the position for a data in the future sorted sequence.Second, I totally agree with the saying that this algorithm highly depends on the quality of distribution function fitting. That’s why I choose the General Vector Machine(arXiv:1704.06885), it is a 3-layer NN without Backpropogation. It is based on Monte-Carlo method and you can read this paper if you are interested. GVM is suitable for function fitting as the fig 2,5,6 in arXiv:1704.06885. And since it's a 3-layer NN, it can reduce time complexity for the ML. And if the distribution has some sharp shape region, we may apply ML piecewise.Third, in the paper, I compare the relation between M and LogN. Actually the theoretical lower boundary of M is 1, in this case we assume we know the cummulative distribution in advance. But, unfortunately, the number of neurons M is usually larger than 1(we take M=100 in this paper), we are not able to get a good approximate distribution with too little neurons. Actually, I think it's worth for further study. Also, with this O(NM) sorting algorithm, we do not mean it’s faster than the NlogN quick sorting in all the cases. If we can apply GPU/TPU acceleration, it may show a huge potential for sorting large data. The current parallel computing sorting algorithms spend a large proportion of time in networking, but with our method, little networking is needed, hence we can largely reduce networking time cost.If you would like to further discuss about our paper, welcome to email me at phzhq@mail.ustc.edu.cn

devilsbabeabout 7 years ago

It would have been much more interesting to report on tests of the algorithm on real data.

partycoderabout 7 years ago

You could also claim sleep sort is O(N).

7 comments

titzerabout 7 years ago

评论 #17072311 未加载

评论 #17072410 未加载

yorwbaabout 7 years ago

CarolineWabout 7 years ago

评论 #17064337 未加载

评论 #17072206 未加载

dspillettabout 7 years ago

phzhqabout 7 years ago

devilsbabeabout 7 years ago

It would have been much more interesting to report on tests of the algorithm on real data.

partycoderabout 7 years ago

You could also claim sleep sort is O(N).

An O(N) Sorting Algorithm: Machine Learning Sorting

7 comments

An O(N) Sorting Algorithm: Machine Learning Sorting

7 comments