If you use radix sort the sorted array approach becomes O(N). And since the actual order is irrelevant, radix sort will work for any fixed size data structure (or even objects of varying size if you just sort and split the array according to size first).<p>And if you're worried that for objects of size k, radix sort runs in O(n k), rather than O(n), I'd like to point out that n*k = N is simply the size in memory, so it is in fact still linear in the size of the input, so it truly is O(N).<p>For string-like objects using a radix tree might be better, although the result is pretty similar.
> I choose 64-bit integers, but strings would do fine as well.<p>No it would not.<p>Sorting 64 bits integers in a vector is fast obviously since they are stored contiguously in memory, so you get to benefit from all the caching layers (L1, L2, L3, TLB) nicely.<p>Since string are of unknown size, the actual string content is stored somewhere in the heap and reference it via a pointer [1].
If you have a vector of strings, to compare string i and string j, you will need to make those 2 pointers dereference , which will jump to 2 random memory locations.<p>With the hash table you already pay for 1 indirection, which is really what is slow here. With a vector<string> you now also need to pay for 1 indirection, so the performance compared to a vector<int> would be very different.<p>[1] <a href="https://stackoverflow.com/questions/42049778/where-is-a-stdstring-allocated-in-memory" rel="nofollow">https://stackoverflow.com/questions/42049778/where-is-a-stds...</a>
I'd expect better write up from Daniel. std::unordered_set is implemented with chained buckets, which incurs extra memory references. Cursory google revealed that an open addressing hash set would be about 3x faster, which would handily beat the sort version.
Just weeks ago, I had to pay out a $5000 bounty on my Cuckoo Cycle proof of work scheme [1], because I had wrongly assumed that hash sets were faster, given that the hash set reduces to a simple bitmap.<p>Where the article considers up to 10M elements,
Cuckoo Cycle deals with about a billion elements,
thus answering the question of what happens when cranking up the data size. It turns out that despite using 32x to 64x more memory than the bitmap, sorting is about 4x faster.<p>Blog entry [2] explains how Cuckoo Cycle reduces to a counting problem.<p>[1] <a href="https://github.com/tromp/cuckoo" rel="nofollow">https://github.com/tromp/cuckoo</a>
[2] <a href="http://cryptorials.io/beyond-hashcash-proof-work-theres-mining-hashing" rel="nofollow">http://cryptorials.io/beyond-hashcash-proof-work-theres-mini...</a>
The article isn't concerned with estimates, but if you have large amounts of data, approximate numbers may be sufficient. I am a fan of HyperLogLog. It can be used as an online algorithm, so you don't have to keep all values memory.<p>It could be useful if you want to do something like estimate the number of unique links posted to twitter in week. (if you could get that data)<p><a href="https://en.wikipedia.org/wiki/HyperLogLog" rel="nofollow">https://en.wikipedia.org/wiki/HyperLogLog</a>
Nice article. There's been a lot of ink spilled about the word histogram problem and the Unix/McIlroy solution vs. Knuth's solution, e.g.:<p><a href="http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/" rel="nofollow">http://www.leancrew.com/all-this/2011/12/more-shell-less-egg...</a><p>I always liked McIlroy's solution because it's shorter. But it bugged me that in theory it was O(n log n) and not O(n).<p>But this post gives some good evidence that this doesn't matter for problems of most practical sizes. And I've actually used McIlroy's solution in practice and I've never found speed to be an issue.<p>Of course it's sorting variable length strings rather than fixed-length records, so there's probably some subtlety. But I like this article because it tests your assumptions.
Judy arrays might be good for this. For the 64 bit int example you could use Judy1:<p><a href="http://judy.sourceforge.net/doc/index.html" rel="nofollow">http://judy.sourceforge.net/doc/index.html</a>
It's an interesting use-case where the data is all known up-front.<p>I'd be interested to see an incrementally-increasing benchmark. I'd imagine the cache-miss penalty from the HashSet is countered by the continuous re-sorting/copying/moving from the Array solution.<p>But that's why we measure, right?
10 rem Using GW-Basic<p>15 end_of_list=10000000<p>20 For x=1 to x=end_of_list<p>30 a(x) = a(x) + 1<p>40 next x<p>50 rem Done counting, print the result<p>60 for x=1 to x=end_of_list<p>70 print "Value " x " has " a(x) " repetitions"<p>80 next x