TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Counting exactly the number of distinct elements: sorted arrays vs. hash sets?

46 pointsby deafcalculusabout 8 years ago

11 comments

contravariantabout 8 years ago
If you use radix sort the sorted array approach becomes O(N). And since the actual order is irrelevant, radix sort will work for any fixed size data structure (or even objects of varying size if you just sort and split the array according to size first).<p>And if you&#x27;re worried that for objects of size k, radix sort runs in O(n k), rather than O(n), I&#x27;d like to point out that n*k = N is simply the size in memory, so it is in fact still linear in the size of the input, so it truly is O(N).<p>For string-like objects using a radix tree might be better, although the result is pretty similar.
评论 #14405666 未加载
vmarsyabout 8 years ago
&gt; I choose 64-bit integers, but strings would do fine as well.<p>No it would not.<p>Sorting 64 bits integers in a vector is fast obviously since they are stored contiguously in memory, so you get to benefit from all the caching layers (L1, L2, L3, TLB) nicely.<p>Since string are of unknown size, the actual string content is stored somewhere in the heap and reference it via a pointer [1]. If you have a vector of strings, to compare string i and string j, you will need to make those 2 pointers dereference , which will jump to 2 random memory locations.<p>With the hash table you already pay for 1 indirection, which is really what is slow here. With a vector&lt;string&gt; you now also need to pay for 1 indirection, so the performance compared to a vector&lt;int&gt; would be very different.<p>[1] <a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;42049778&#x2F;where-is-a-stdstring-allocated-in-memory" rel="nofollow">https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;42049778&#x2F;where-is-a-stds...</a>
评论 #14405919 未加载
vicayaabout 8 years ago
I&#x27;d expect better write up from Daniel. std::unordered_set is implemented with chained buckets, which incurs extra memory references. Cursory google revealed that an open addressing hash set would be about 3x faster, which would handily beat the sort version.
评论 #14408567 未加载
trompabout 8 years ago
Just weeks ago, I had to pay out a $5000 bounty on my Cuckoo Cycle proof of work scheme [1], because I had wrongly assumed that hash sets were faster, given that the hash set reduces to a simple bitmap.<p>Where the article considers up to 10M elements, Cuckoo Cycle deals with about a billion elements, thus answering the question of what happens when cranking up the data size. It turns out that despite using 32x to 64x more memory than the bitmap, sorting is about 4x faster.<p>Blog entry [2] explains how Cuckoo Cycle reduces to a counting problem.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;tromp&#x2F;cuckoo" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tromp&#x2F;cuckoo</a> [2] <a href="http:&#x2F;&#x2F;cryptorials.io&#x2F;beyond-hashcash-proof-work-theres-mining-hashing" rel="nofollow">http:&#x2F;&#x2F;cryptorials.io&#x2F;beyond-hashcash-proof-work-theres-mini...</a>
Paul-ishabout 8 years ago
The article isn&#x27;t concerned with estimates, but if you have large amounts of data, approximate numbers may be sufficient. I am a fan of HyperLogLog. It can be used as an online algorithm, so you don&#x27;t have to keep all values memory.<p>It could be useful if you want to do something like estimate the number of unique links posted to twitter in week. (if you could get that data)<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;HyperLogLog" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;HyperLogLog</a>
评论 #14405802 未加载
TwoBitabout 8 years ago
How do we know the observed behaviour isn&#x27;t mostly due to the hash set node allocator? If you want speed, you don&#x27;t use global malloc.
评论 #14405219 未加载
chubotabout 8 years ago
Nice article. There&#x27;s been a lot of ink spilled about the word histogram problem and the Unix&#x2F;McIlroy solution vs. Knuth&#x27;s solution, e.g.:<p><a href="http:&#x2F;&#x2F;www.leancrew.com&#x2F;all-this&#x2F;2011&#x2F;12&#x2F;more-shell-less-egg&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.leancrew.com&#x2F;all-this&#x2F;2011&#x2F;12&#x2F;more-shell-less-egg...</a><p>I always liked McIlroy&#x27;s solution because it&#x27;s shorter. But it bugged me that in theory it was O(n log n) and not O(n).<p>But this post gives some good evidence that this doesn&#x27;t matter for problems of most practical sizes. And I&#x27;ve actually used McIlroy&#x27;s solution in practice and I&#x27;ve never found speed to be an issue.<p>Of course it&#x27;s sorting variable length strings rather than fixed-length records, so there&#x27;s probably some subtlety. But I like this article because it tests your assumptions.
hoytechabout 8 years ago
Judy arrays might be good for this. For the 64 bit int example you could use Judy1:<p><a href="http:&#x2F;&#x2F;judy.sourceforge.net&#x2F;doc&#x2F;index.html" rel="nofollow">http:&#x2F;&#x2F;judy.sourceforge.net&#x2F;doc&#x2F;index.html</a>
评论 #14411075 未加载
LgWoodenBadgerabout 8 years ago
It&#x27;s an interesting use-case where the data is all known up-front.<p>I&#x27;d be interested to see an incrementally-increasing benchmark. I&#x27;d imagine the cache-miss penalty from the HashSet is countered by the continuous re-sorting&#x2F;copying&#x2F;moving from the Array solution.<p>But that&#x27;s why we measure, right?
kevinwangabout 8 years ago
Wow, I would definitely not expect that.
woliveirajrabout 8 years ago
10 rem Using GW-Basic<p>15 end_of_list=10000000<p>20 For x=1 to x=end_of_list<p>30 a(x) = a(x) + 1<p>40 next x<p>50 rem Done counting, print the result<p>60 for x=1 to x=end_of_list<p>70 print &quot;Value &quot; x &quot; has &quot; a(x) &quot; repetitions&quot;<p>80 next x
评论 #14405390 未加载