TechEcho

11 comments

If you use radix sort the sorted array approach becomes O(N). And since the actual order is irrelevant, radix sort will work for any fixed size data structure (or even objects of varying size if you just sort and split the array according to size first).And if you're worried that for objects of size k, radix sort runs in O(n k), rather than O(n), I'd like to point out that n*k = N is simply the size in memory, so it is in fact still linear in the size of the input, so it truly is O(N).For string-like objects using a radix tree might be better, although the result is pretty similar.

评论 #14405666 未加载

vmarsyabout 8 years ago

> I choose 64-bit integers, but strings would do fine as well.No it would not.Sorting 64 bits integers in a vector is fast obviously since they are stored contiguously in memory, so you get to benefit from all the caching layers (L1, L2, L3, TLB) nicely.Since string are of unknown size, the actual string content is stored somewhere in the heap and reference it via a pointer [1]. If you have a vector of strings, to compare string i and string j, you will need to make those 2 pointers dereference , which will jump to 2 random memory locations.With the hash table you already pay for 1 indirection, which is really what is slow here. With a vector<string> you now also need to pay for 1 indirection, so the performance compared to a vector<int> would be very different.[1] <a href="https://stackoverflow.com/questions/42049778/where-is-a-stdstring-allocated-in-memory" rel="nofollow">https://stackoverflow.com/questions/42049778/where-is-a-stds...</a>

评论 #14405919 未加载

vicayaabout 8 years ago

I'd expect better write up from Daniel. std::unordered_set is implemented with chained buckets, which incurs extra memory references. Cursory google revealed that an open addressing hash set would be about 3x faster, which would handily beat the sort version.

评论 #14408567 未加载

trompabout 8 years ago

Just weeks ago, I had to pay out a $5000 bounty on my Cuckoo Cycle proof of work scheme [1], because I had wrongly assumed that hash sets were faster, given that the hash set reduces to a simple bitmap.Where the article considers up to 10M elements, Cuckoo Cycle deals with about a billion elements, thus answering the question of what happens when cranking up the data size. It turns out that despite using 32x to 64x more memory than the bitmap, sorting is about 4x faster.Blog entry [2] explains how Cuckoo Cycle reduces to a counting problem.[1] <a href="https://github.com/tromp/cuckoo" rel="nofollow">https://github.com/tromp/cuckoo</a> [2] <a href="http://cryptorials.io/beyond-hashcash-proof-work-theres-mining-hashing" rel="nofollow">http://cryptorials.io/beyond-hashcash-proof-work-theres-mini...</a>

Paul-ishabout 8 years ago

The article isn't concerned with estimates, but if you have large amounts of data, approximate numbers may be sufficient. I am a fan of HyperLogLog. It can be used as an online algorithm, so you don't have to keep all values memory.It could be useful if you want to do something like estimate the number of unique links posted to twitter in week. (if you could get that data)<a href="https://en.wikipedia.org/wiki/HyperLogLog" rel="nofollow">https://en.wikipedia.org/wiki/HyperLogLog</a>

评论 #14405802 未加载

TwoBitabout 8 years ago

How do we know the observed behaviour isn't mostly due to the hash set node allocator? If you want speed, you don't use global malloc.

评论 #14405219 未加载

chubotabout 8 years ago

Nice article. There's been a lot of ink spilled about the word histogram problem and the Unix/McIlroy solution vs. Knuth's solution, e.g.:<a href="http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/" rel="nofollow">http://www.leancrew.com/all-this/2011/12/more-shell-less-egg...</a>I always liked McIlroy's solution because it's shorter. But it bugged me that in theory it was O(n log n) and not O(n).But this post gives some good evidence that this doesn't matter for problems of most practical sizes. And I've actually used McIlroy's solution in practice and I've never found speed to be an issue.Of course it's sorting variable length strings rather than fixed-length records, so there's probably some subtlety. But I like this article because it tests your assumptions.

hoytechabout 8 years ago

Judy arrays might be good for this. For the 64 bit int example you could use Judy1:<a href="http://judy.sourceforge.net/doc/index.html" rel="nofollow">http://judy.sourceforge.net/doc/index.html</a>

评论 #14411075 未加载

LgWoodenBadgerabout 8 years ago

It's an interesting use-case where the data is all known up-front.I'd be interested to see an incrementally-increasing benchmark. I'd imagine the cache-miss penalty from the HashSet is countered by the continuous re-sorting/copying/moving from the Array solution.But that's why we measure, right?

kevinwangabout 8 years ago

Wow, I would definitely not expect that.

woliveirajrabout 8 years ago

10 rem Using GW-Basic15 end_of_list=1000000020 For x=1 to x=end_of_list30 a(x) = a(x) + 140 next x50 rem Done counting, print the result60 for x=1 to x=end_of_list70 print "Value " x " has " a(x) " repetitions"80 next x

评论 #14405390 未加载

11 comments

contravariantabout 8 years ago

评论 #14405666 未加载

vmarsyabout 8 years ago

评论 #14405919 未加载

vicayaabout 8 years ago

评论 #14408567 未加载

trompabout 8 years ago

Paul-ishabout 8 years ago

评论 #14405802 未加载

TwoBitabout 8 years ago

How do we know the observed behaviour isn't mostly due to the hash set node allocator? If you want speed, you don't use global malloc.

评论 #14405219 未加载

chubotabout 8 years ago

hoytechabout 8 years ago

评论 #14411075 未加载

LgWoodenBadgerabout 8 years ago

kevinwangabout 8 years ago

Wow, I would definitely not expect that.

woliveirajrabout 8 years ago

评论 #14405390 未加载

Counting exactly the number of distinct elements: sorted arrays vs. hash sets?

11 comments

Counting exactly the number of distinct elements: sorted arrays vs. hash sets?

11 comments