Computer scientists invent an efficient new way to count

993 pointsby jasondaviesabout 1 year ago

60 comments

zero_kabout 1 year ago

I was involved with implementing the DNF volume counting version of this with the authors. You can see my blog post of it here:<a href="https://www.msoos.org/2023/09/pepin-our-probabilistic-approximate-volume-counter" rel="nofollow">https://www.msoos.org/2023/09/pepin-our-probabilistic-approx...</a>And the code here: <a href="https://github.com/meelgroup/pepin">https://github.com/meelgroup/pepin</a>Often, 30% of the time is spent in IO of reading the file, that's how incredibly fast this algorithm is. Crazy stuff.BTW, Knuth contributed to the algo, Knuths' notes: <a href="https://cs.stanford.edu/~knuth/papers/cvm-note.pdf" rel="nofollow">https://cs.stanford.edu/~knuth/papers/cvm-note.pdf</a>He actually took time off (a whole month) from TAOCP to do this. Also, he is exactly as crazy good as you'd imagine. Just mind-blowing.

评论 #40393191 未加载

评论 #40536676 未加载

评论 #40394422 未加载

评论 #40399131 未加载

评论 #40393974 未加载

pixelmonkeyabout 1 year ago

This algorithm seems to resemble HyperLogLog (and all its variants), which is also cited in the research paper. Using the same insight of the estimation value of tracking whether we've hit a "run" of heads or tails, but flipping the idea on its head (heh), it leads to the simpler algorithm described, which is about discarding memorized values on the basis of runs of heads/tails.This also works especially well (that is, efficiently) in the streaming case, allowing you to keep something resembling a "counter" for the distinct elements, albeit with a error rate.The benefit of HyperLogLog is that it behaves similarly to a hash set in some respects -- you can add items, count distinct them, and, importantly, merge two HLLs together (union), all the while keeping memory fixed to mere kilobytes even for billion-item sets. In distributed data stores, this is the trick behind Elasticsearch/OpenSearch cardinality agg, as well as behind Redis/Redict with its PFADD/PFMERGE/PFCOUNT.I am not exactly sure how this CVM algorithm compares to HLL, but they got Knuth to review it, and they claim an undergrad can implement it easily, so it must be pretty good!

评论 #40387614 未加载

评论 #40396603 未加载

评论 #40389058 未加载

usgroupabout 1 year ago

I found the paper took about as long to read as the blog post and is more informative:<a href="https://arxiv.org/pdf/2301.10191" rel="nofollow">https://arxiv.org/pdf/2301.10191</a>It is about estimating the cardinality of a set of elements derived from a stream. The algorithm is so simple, you can code it and play with it whilst you read the paper.The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks.

评论 #40387733 未加载

评论 #40388638 未加载

评论 #40389210 未加载

评论 #40395172 未加载

评论 #40388003 未加载

评论 #40387945 未加载

评论 #40389040 未加载

jtandersonabout 1 year ago

Given the topic of the paper[0], the footnote is especially charming:> the authors decided to forgo the old convention of alphabetical ordering of authors in favor of a randomized ordering, denoted by r⃝. The publicly verifiable record of the randomization is available at <a href="https://www.aeaweb.org/journals/policies/random-author-order/search" rel="nofollow">https://www.aeaweb.org/journals/policies/random-author-order...</a>[0]: <a href="https://arxiv.org/pdf/2301.10191" rel="nofollow">https://arxiv.org/pdf/2301.10191</a>edit: formatting

pytnessabout 1 year ago

Is it me or is the description of the algo wrong?<pre><code> > Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; </code></pre> If i follow this description of "check if exists in list -> delete":<pre><code> if hash_set.contains(word) { if !keep_a_word(round) { hash_set.remove(word); continue; } } else { hash_set.insert(word.to_string()); } </code></pre> The algorithm runs for ~20 iterations:<pre><code> Total word count: 31955 | limit: 1000 End Round: 20, word count: 737 Unique word count: 7233 Estimated unique word count: 772800512 </code></pre> But if I save the word first and then delete the same word:<pre><code> hash_set.insert(word.to_string()); if !keep_a_word(round) { hash_set.remove(word); continue; } </code></pre> It gets the correct answer:<pre><code> Total word count: 31955 | 1000 End Round: 3, word count: 905 Unique word count: 7233 Estimated unique word count: 7240</code></pre>

评论 #40389764 未加载

评论 #40389730 未加载

melqabout 1 year ago

Estimating the amount of unique elements in a set and counting the amount of unique elements in a set are very different things. Cool method, bad headline.

评论 #40389202 未加载

评论 #40394015 未加载

评论 #40387753 未加载

akamoonknightabout 1 year ago

I don't know a word or phrase for this, but I really enjoy any examples of "thinking outside the box" like this because it's something I struggle with in my professional career. Learning not only the right ways to solve problems, but figuring out the questions to ask that make solving the problems you have easier or even in some cases possible. In this case, it's hey, we don't need exact numbers if we can define a probabilistic range given defined parameters. Other problems are gonna have other questions. I guess my hope is that if I see enough examples I'll be able to eventually internalize the thought process and apply it correctly.

评论 #40389692 未加载

评论 #40387378 未加载

rep_lodsbabout 1 year ago

>What if you’re Facebook, and you want to count the number of distinct users who log in each day, even if some of them log in from multiple devices and at multiple times?Seems like a bad example of when this algorithm could be useful in practice.If you already know you will want this info when designing the login process, it's simple: keep track of the date of last login for each account, and increment the unique user counter when the stored value is different from the current one.And even if not, it should still be possible to "replay" a stream of login events from the database later to do this analysis. Unless maybe you already had years worth of data?

评论 #40517812 未加载

gnfargblabout 1 year ago

On the topic of counting things, I would like to mention this efficient and easily-implemented algorithm for finding the top-k items in a stream, which I think is perhaps not as well known as it should be:A Simple Algorithm for Finding Frequent Elements in Streams and BagsKarp, Shenker & Papadimitriou<a href="https://www.cs.umd.edu/~samir/498/karp.pdf" rel="nofollow">https://www.cs.umd.edu/~samir/498/karp.pdf</a>

评论 #40389565 未加载

eterevskyabout 1 year ago

Computer scientists invent a memory-efficient way to estimate the size of a subset

评论 #40389792 未加载

评论 #40394094 未加载

评论 #40394150 未加载

kuldeepmeelabout 1 year ago

We are very grateful for the interest, and I thought I would link to some relevant resources.Paper: <a href="https://arxiv.org/pdf/2301.10191" rel="nofollow">https://arxiv.org/pdf/2301.10191</a> Knuth's note: <a href="https://cs.stanford.edu/~knuth/papers/cvm-note.pdf" rel="nofollow">https://cs.stanford.edu/~knuth/papers/cvm-note.pdf</a>Talk Slides: <a href="https://www.cs.toronto.edu/~meel/Slides/meel-distinct.pdf" rel="nofollow">https://www.cs.toronto.edu/~meel/Slides/meel-distinct.pdf</a> Talk Video: <a href="https://www.youtube.com/watch?v=K_ugk7OW0bI" rel="nofollow">https://www.youtube.com/watch?v=K_ugk7OW0bI</a>The talk also discusses the general settings where our algorithm resolved the open problem of estimation of the union of high dimensional rectangles.

imoverclockedabout 1 year ago

When do we stop calling this counting and start calling it estimation?

评论 #40387490 未加载

评论 #40388196 未加载

评论 #40387272 未加载

评论 #40387264 未加载

mudiadamzabout 1 year ago

Python implementation:<pre><code> def streaming_algorithm(A, epsilon, delta): # Initialize parameters p = 1 X = set() thresh = math.ceil((12 / epsilon ** 2) * math.log(8 * len(A) / delta)) # Process the stream for ai in A: if ai in X: X.remove(ai) if random.random() < p: X.add(ai) if len(X) == thresh: X = {x for x in X if random.random() >= 0.5} p /= 2 if len(X) == thresh: return '⊥' return len(X) / p # Example usage A = [1, 2, 3, 1, 2, 3] epsilon = 0.1 delta = 0.01 output = streaming_algorithm(A, epsilon, delta) print(output)</code></pre>

评论 #40389666 未加载

评论 #40389818 未加载

评论 #40389516 未加载

评论 #40395036 未加载

评论 #40389626 未加载

theszabout 1 year ago

HyperLogLog uses additions, it keeps sums. Thus, you can subtract one HLL sums from other. This is useful if stream supports deletion. Streams with deletions can be found in log-structured merge trees, for one example, so one can estimate count of distinct elements in all of the LSM tree hierarchy.The algorithm in the paper does not allow for deletions.Also, if one counts statistics of the stream of large elements (say, SHA-512 hashes, 64 bytes per hash), this algorithm requires some storage for elements from this stream, so memory requirement is O(table size * element size).

saulrhabout 1 year ago

Huh, that's a clever twist on reservoir sampling. Neat.

vitusabout 1 year ago

From the paper [0]:> We state the following well-known concentration bound, Chernoff bound, for completeness.Which variant of the Chernoff bound is this? This is almost the (looser variant of the) multiplicative form, but it's not quite right (per the use of 1+delta instead of a single parameter beta). In particular, that bound is only guaranteed to hold for delta >= 0 (not beta = 1 + delta > 0 as asserted in the paper)[0] <a href="https://arxiv.org/pdf/2301.10191" rel="nofollow">https://arxiv.org/pdf/2301.10191</a>[1] <a href="https://en.wikipedia.org/wiki/Chernoff_bound#Multiplicative_form_(relative_error)" rel="nofollow">https://en.wikipedia.org/wiki/Chernoff_bound#Multiplicative_...</a>edit: to be clear: I'm not sure at all whether this breaks the proof of correctness, although I'm having a bit of difficulty following the actual details (I think I'd need to work through the intermediate steps on paper).

oertlabout 1 year ago

The algorithm is simple, which is why it is also suitable for textbooks. However, it is far from representing the state of the art. In case you are interested in the latest development, I have recently published a new distinct counting algorithm where the estimate does not depend on the processing order, which can be merged and requires only 3.5kB of memory to guarantee a standard error of 1.13% for counts up to exa-scale: <a href="https://arxiv.org/abs/2402.13726" rel="nofollow">https://arxiv.org/abs/2402.13726</a>

j_m_babout 1 year ago

I jumped on the code generation bandwagon pretty soon after ChatGPT was released. I like to think of similar to a catalyst in a chemical reaction, it lowers the activation barrier to writing code, especially in new contexts (to you). It makes it easier to just get started. However, it struggles with solving actual problems or even putting together the right context. If you know how to break the problem down to simpler steps, it can help with those. It won't write a new kernel driver for you.

评论 #40401431 未加载

pierrebaiabout 1 year ago

IDK, if my explanation is correct, but I do believe it is. I t goes as follow.Imagine that you have a container of potential limitless capacity. The container starts with smalls capacity, equal to the real limited capacity that the real algorithm uses. As you add elements, when the container is full, its capacity is doubled, but all elements are then placed in a random position.When you're done, you're told the occupancy of the subset of the large container corresponding to the initial size and how many times the container size was doubled. Multiplying that occupancy by the power of two of the number of doubling gives you an approximation of the real size.The small catch is that in the actual algorithm, due to the discarding, the final number of elements, the occupancy, is somewhat erroneous.EDITAnother way to say this: you got a container of limited capacity S. When full, you "virtually" double its size and then randomly move elements over the full "imaginary" size of the virtual container. So after the first filling, you end up with about 1/2 the elements. After the second filling 1/4, etc. Also, since now your "virtual" container is larger, when you add a new element, there is only 1/2^n the it will be place inside your limited-capacity view of the entire virtual container.At the end, the approximate real count is the number of elements you got multiplied by 2 to the power of the number of size doubling.Again, it is as if you have a small window into a limitless container.

burjuiabout 1 year ago

The algorithm uses less memory, but more CPU time because of rather frequent deletions, so it's a tradeoff, not just generally better algorithm, as article may suggest.

评论 #40393134 未加载

unnouinceputabout 1 year ago

"Computer scientists invent an efficient new way to approximate counting large unique entries" - fixed the title

kromemabout 1 year ago

Ah, so Thanos was just conducting a census.

评论 #40389613 未加载

arn3nabout 1 year ago

Does anyone else notice that quanta consistently has good illustrations and diagrams for their articles? Good visualizations and graphics can do extraordinary things for the understandability of a topic -- people like 3Blue1Brown understand this, and I'm glad quanta puts effort into it.

methyanddinkyabout 1 year ago

This is actually a really sad, redundant example of the regression of education in the United States.This "sure to be lauded" technique is pretty simple and has been exploited by simpleton gamblers for a very long time.Its basically simple math disguised as "mentalism".But since you, "very new to " "the real world" people figure out your folly. I promise I wont laugh. But I am laughing at how incredibly hollowed out the education system of the USA is when a simple compund method for an old trick, substitutes itself for math.The "math" involved is purely low IQ points on the inventors side.

112233about 1 year ago

After skimming Knuth's paper — does the algorithm work if values are hashed, that is, the "uniform deviate" is selected deterministically for each unique value of the stream?

评论 #40387651 未加载

leduyquang753about 1 year ago

In the algorithm depicted in the paper, if no elements manage to be eliminated from the set, why not just retry rather than return ⊥?

评论 #40397516 未加载

hum3hum3about 1 year ago

The counting/estimstion technique is rather like a floating point number. An integer exponent k and a mantissa of a population.

lux12 months ago

Whipped up a quick PHP version for fun:<a href="https://github.com/jbroadway/distinctelements">https://github.com/jbroadway/distinctelements</a>

SoftTalkerabout 1 year ago

New interview question: Tell me how you would estimate the number of unique words in Hamlet using only 100 elements.

评论 #40406810 未加载

biscuit1v9about 1 year ago

> The trick, he said, is to rely on randomization.> When the space is full, press pause and flip a coin for each word. Heads, and the word stays on the list; tails, and you delete it.I wasn't expecting to go that far: randomization. How can you verify if the answer is good? Only approximation, maybe..

评论 #40387437 未加载

评论 #40387461 未加载

评论 #40387428 未加载

prerokabout 1 year ago

While the algorithm is interesting and I commend the authors for the algorithm, it should be noted that this is a guesstimate only. So, it's not really an efficient way to count distinct values but an efficient way to get an approximate count of distinct values.

mring33621about 1 year ago

Feels like reservoir sampling to me

u8about 1 year ago

I took a crack at implementing this in Go. For anyone curious I settled for algorithm 2 as I can just use a map as the base set structure.<a href="https://github.com/tristanisham/f0">https://github.com/tristanisham/f0</a>

drycabinetabout 1 year ago

Does finding the number of unique elements in a set actually require comparison of each element with everything else? Can't you use a hashtable? For every element, add it to the table (ignore if already exists), and finally, take a count of keys.

评论 #40388283 未加载

评论 #40394415 未加载

评论 #40388056 未加载

评论 #40388135 未加载

评论 #40388072 未加载

评论 #40388061 未加载

klaussilveiraabout 1 year ago

This actually comes at a good time. I'm currently refactoring a system that counts visits using inserts into a table, with a tuple of date and IP. I was planning to replace it with a HLL approach, but this is really interesting.

ngrillyabout 1 year ago

It is quite amazing that after we had so many talented researchers working for decades on this problem, the authors were able to come up with such an elegant and simple solution. Another proof that research is very necessary and useful.

brianhorakhabout 1 year ago

Wondering if this approach could I found myself be applied to CFD (computational flow dynamics) methods to reduce the total volume of data points while still getting approximately close to the correct final answer?

cevingabout 1 year ago

Can some native speaker tell me: is "count" the new "guess"?

评论 #40390835 未加载

评论 #40390622 未加载

评论 #40394233 未加载

deadeyeabout 1 year ago

Am I correct in thinking the accuracy of this method has more to do with the distribution in the sample than the algo itself. Meaning, given a 100% unique list of items, how far off would this estimate be?

评论 #40391550 未加载

评论 #40397963 未加载

unbalancedevhabout 1 year ago

I wonder how the error changes if you start a new round just before finishing the stream, compared to if the last word in the stream just fills the buffer and ends a round.

gh0stcloudabout 1 year ago

very interesting solution. A perfect example of how even some very complicated problems can sometimes have simple solutions. Will definitely try to write an implementation of this.The only minor downside to this is that it's obviously not deterministic since it depends on chance. But for many applications where the dataset is so big it doesn't fit in memory, that's probably just a tiny rounding error anyway.

fnord77about 1 year ago

With other count-distinct algorithms, you can do unions and intersections on the "sets". (theta sketches, even bloom filiters)Can you do this with CVM?

评论 #40410268 未加载

novaRomabout 1 year ago

I wish we had a theory connecting randomness with time/space complexity.Intuitively I think the key to many computational limitations is making use of randomness.

评论 #40388229 未加载

volimKnjigeabout 1 year ago

Which books are in the background, I can't read their titles because picture is blurred too much?? Help please !!! :)

igammaraysabout 1 year ago

Can LLM’s invent a new simple algorithm like this which it has never seen before? I tried to induce ChatGPT to invent this algorithm, giving it hints along the way, but it came up with something else that I’m unsure is correct. Then again, most humans wouldn’t be able to do that either. But how intelligent is AI if it can’t invent anything truly novel and significant?

asahabout 1 year ago

parallelism is an important consideration on modern systems. While the memory clearing step parallelizes, the inner loop seems hard to parallelize?also, how do we combine two sketches? (another form of parallelism)

dist-epochabout 1 year ago

Do they assume any particular distribution of the items?Otherwise Nassim Taleb would like a word with them.

评论 #40387975 未加载

ilya_mabout 1 year ago

"In a recent paper" - really? The paper first appeared in ESA 2022. The revised version (with some errors fixed) is from May 2023. To be fair, compared to the rate of progress in this area between Flajolet & Martin (1985) and the previous SOTA (Blasiok, 2018), a couple of years of sitting on this news is not a lot.

评论 #40389742 未加载

cb321about 1 year ago

This Nim program might clarify some things for someone.<pre><code> import std/[random, math, sets, times]; type F = float template uniqCEcvm*(it; xfm; k=100, alpha=0.05): untyped = ## Return (unbiased estim. cardinality of `it`, 1-alpha CI ebound*(1 +- e)). var x1 = initHashSet[type(xfm)](k); var x = x1.addr var x2 = initHashSet[type(xfm)](k); var y = x2.addr var p = 1.0 # p always 2^-i means could ditch FP var n = 0 #..RNG & arith for bit shifts/masks. let t0 = epochTime() for e in it: inc n #; let e = xfm # to compute randState.next here x[].excl e # 'd be nice to reduce to 1 hash by if rand(1.0) < p: #..excl w/prob 1 - p but then cannot x[].incl e #..add new keys. p -> 0 anyway. while x[].len == k: # KnuthLoop not CVMIf guards against for e in x[]: #..unlikely case of freeing nothing. if rand(1.0) < 0.5: y[].incl e p *= 0.5 swap x, y; y[].clear # ↓ e.bound conservative by ~15X (int(x[].len.F/p), sqrt(12.0*ln(8.0*n.F/alpha)/k.F), (epochTime() - t0)*1e9/n.F) when isMainModule: # Takes nItem/trial dupMask space nTrial when defined danger: randomize() import std/[strutils, os, strformat] let n = if paramCount()>0: parseInt(paramStr(1)) else: 100000 let msk = (if paramCount()>1: parseInt(paramStr(2)) else: 0xFFFF).uint64 let k = if paramCount()>2: parseInt(paramStr(3)) else: 512 # 32KiB let m = if paramCount()>3: parseInt(paramStr(4)) else: 35 var ex = initHashSet[uint64]() var xs = newSeq[uint64](n) for j in 1..m: xs.setLen 0; ex.clear for i in 1..n: xs.add randState.next and msk for e in xs: ex.incl e let (apx, err, t) = uniqCEcvm(xs, uint64, k) let ap = apx.F; let an = ex.len.F echo ex.len," ",apx,&" eb: {err:.4f} |x-a|/x: {abs(ap-an)/an:.4f} {t:.1f}ns" </code></pre> First few lines of laptop output:<pre><code> 51160 49152 eb: 0.6235 |x-a|/x: 0.0392 10.5ns 51300 51840 eb: 0.6235 |x-a|/x: 0.0105 10.7ns 51250 55168 eb: 0.6235 |x-a|/x: 0.0764 10.9ns 51388 52352 eb: 0.6235 |x-a|/x: 0.0188 10.5ns </code></pre> Ditching the floating point RNG & arithmetic might be able to lower that time per element cost to more like 5 ns or maybe 2GiB/s for 8B ints. The toggle back & forth between old & new HashSet is just to re-use the same L1 CPU cache memory.Note also that the error bound in the CVM paper seems very conservative. My guess is that it could maybe be tightened by 15+X (at the scale of those example numbers), but that might also require some kind of special function evaluation. Anyone have any ideas on that?Also note that since this is just estimating unique identities, you can always just count unique 64-bit hashes of keys (or maybe even 32-bit hashes depending on your accuracy needs and cardinalities) if you care about the key space/key compares are slow.

评论 #40389504 未加载

6510about 1 year ago

Could you pick 1000 randomly and repeat the process endlessly?

OutOfHereabout 1 year ago

The title is misleading. This is about probabilistic counting, therefore about estimation. It is not clear if the efficiency benefit extends to exact counting. Counting and estimation are not the same concepts.

评论 #40394143 未加载

评论 #40393817 未加载

volimKnjigeabout 1 year ago

Which books are in background?

minkzillaabout 1 year ago

There is a big practicality problem I see with this algorithm. The thresh defined in the paper relies on the length of the stream. It seems to me that in a scenario where you have a big enough data set to desire a solution that doesn't just store every unique value you don't know the length. I did not make it through all the proofs but I believe they use the fact that the defined threshold has the length in it to prove error bounds. If I were to use this in a scenario where I need to know error bounds i would probably ballpark the length of my stream to estimate error bounds and then use the algorithm with a ballpark threshold depending on my systems memory.Another practical thing is the "exception" if nothing is removed on line 6 in the original algorithm. This also seems needed for the proof but you would not want in production, though the chance of hitting it should be vanishingly small so maybe worth the gamble?Here is my faithful interpretation of the algorithm. And then a re-interpretation with some "practical" improvements that almost certainly make the provability of the correctness impossible.<pre><code> func CountUnique(scanner *bufio.Scanner, epsilon float64, delta float64, m int) int { X := make(map[string]bool) p := 1.0 thresh := int(math.Ceil((12 / (epsilon * epsilon)) \* math.Log(8*float64(m)/delta))) for scanner.Scan() { a := scanner.Text() delete(X, a) if rand.Float64() </pre> }<pre><code> func CountUnique2(scanner *bufio.Scanner, thresh int) int { //threshold passed in, based on system memory / estimates X := make(map[string]bool) p := 1.0 for scanner.Scan() { a := scanner.Text() delete(X, a) if rand.Float64() = thresh { // >= instead of == and remove the panic below for key := range X { if rand.Float64() < 0.5 { delete(X, key) } } p /= 2 } } return int(float64(len(X)) / p)</code></pre> }I tested it with Shakespeare's work. The actual unique word count is 71,595. With the second algorithm it is interesting to play with the threshold. Here are some examples.threshold 1000 Mean Absolute Error: 2150.44 Root Mean Squared Error: 2758.33 Standard Deviation: 2732.61threshold 2000 Mean Absolute Error: 1723.72 Root Mean Squared Error: 2212.74 Standard Deviation: 2199.39threshold 10000 Mean Absolute Error: 442.76 Root Mean Squared Error: 556.74 Standard Deviation: 555.53threshold 50000 Mean Absolute Error: 217.28 Root Mean Squared Error: 267.39 Standard Deviation: 262.84

评论 #40393192 未加载

renonceabout 1 year ago

At first find this paragraph confusing:> Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list.Why would you delete the word already on the list by flipping coins? Doesn't this reduce the accuracy by counting less words than expected? And will the word be added to the list later?After thinking about it for a while and reading the paper, I've finally developed a good mental model for how this algorithm works as below, which should convince even a high schooler why the algorithm works:1. You are given a streaming set of elements from [n] (a finite set of n distinct elements). Now let's give each element a random uniform real number in [0,1]. This helps us choose the elements we want: if we choose all elements with the number below 0.5, we get about half of the elements.2. Assume for a moment we have unbounded memory. Now we maintain a table of already seen elements, and for each element we keep the last real number attached with that element: so if an element A appears three times with the number 0.3, 0.7, 0.6, we keep A with the number 0.6.3. Our memory is bounded! So we keep only the elements below a threshold 2^-r, that is, 1, 0.5, 0.25, etc. So at first we will keep all elements, but when we don't have enough memory, we will need to filter existing elements according to the threshold. It's easy to see that when we decrease threshold, only elements already in memory will meet the criteria and continue to exist in memory. No other elements in the stream will be below threshold. Also note that whether an element is in memory depends only on its last occurence. It can exist in memory for a while, get dropped because a later element does not meet the threshold, and get back in.4. We don't actually have a real number attached to every element! But we can pretend as if there is one. For each new element X from the stream, we replace the number in memory with its attached number, and we only care if its attached number is below 2^-r or not. If it is, it should be in our memory, and if it's not, it should be out. Once the number is in memory, it's a random number in [0,2^-r] and we care no further.When increasing r, we only care about whether the number is in [0,2^{-r-1}]. This has exactly 1/2 probability. So each number has 1/2 probability of getting out, and 1/2 probability of being kept in memory.5. Now it's easy to see that whether an element is in the memory depends solely on the real number attached to its last occurence. That is, each distinct element has exactly 2^-r probability of being in memory. 2^r multiplied by the number of elements in memory gives a good estimate of number of distinct elements.They criticized previous approaches as relying on hash functions. A simplified version of the approach is as follows:1. Make a hash function that maps distinct elements to different uniform real numbers. So all As compute to 0.3, all Bs compute to 0.7, etc.2. Use that hash function to transform all elements. The same element maps to the same real number, and different elements map to different real numbers.3. Keep a set of N smallest distinct real numbers. This works by inserting numbers from the stream to the set. It's easy to see that adding the same element multiple times has exactly no effect; it's either too big to be in the set or the same as an existing number in the set.4. This gives the Nth smallest real number in the set as K. The number of distinct elements is approximated as N/K.This algorithm is remarkable, but not in that it's efficient or it's new. Both algorithm using a hash function and algorithms without hash functions has been around and is optimal. I assume this algorithm is the first optimal algorithm without hash functions that is very easy to understand even by undergraduates.

评论 #40388857 未加载

评论 #40393919 未加载

Tooabout 1 year ago

Now we can finally count how many unique UUIDs there really are! My memory was never enough for this task before.

Gbotexabout 1 year ago

nice read!

aaron695about 1 year ago

Knuth talks about this paper here - "The CVM Algorithm for Estimating Distinct Elements in Streams" - <a href="https://cs.stanford.edu/~knuth/papers/cvm-note.pdf" rel="nofollow">https://cs.stanford.edu/~knuth/papers/cvm-note.pdf</a>

xpeabout 1 year ago

I have questions for Quanta Magazine -- this is probably a shot in the dark, but if anyone can ask them a few questions, here would be three of mine:1. Have independent researchers evaluated the effectiveness of Quanta Magazine at their mission? [1]2. In the case of this article, did the editors consider including visualization? If not, why not?3. Does Quanta have interactive designers and/or data scientists on staff? How many? Why not more?## Background & BiasesI'm trying to overcome my disappointment in Quanta by being curious about their mission, organization, and constraints. I am glad they exist, but sometimes your "closest allies" can be one's harshest critics.I'm greatly troubled by the level of scientific, mathematical, and rational understanding among the denizens of the world. But for some reason, the state of science writing bothers me more. It would seem that I hold out hope that science writers could do better. This may be unfair, I admit.Anyhow, rather than just bash Quanta for being, say, not as good at the best math blogs or YouTube channels (such as 3Blue1Brown), I really want to figure out (a) if I'm missing something; or (2) if they are actively trying to improve; and (3) what we can all learn from their experience.[1] From <a href="https://www.quantamagazine.org/about/" rel="nofollow">https://www.quantamagazine.org/about/</a> : "Quanta Magazine is an editorially independent online publication launched by the Simons Foundation in 2012 to enhance public understanding of science. Why Quanta? Albert Einstein called photons “quanta of light.” Our goal is to “illuminate science.”"

评论 #40400336 未加载

theelous3about 1 year ago

> Imagine that you’re sent to a pristine rainforest to carry out a wildlife census.alt-f4

matt3210about 1 year ago

CS guys always wanting to throw away a good system and start from scratch.

评论 #40387316 未加载

评论 #40387281 未加载

评论 #40387465 未加载

评论 #40387984 未加载

60 comments

zero_kabout 1 year ago

评论 #40393191 未加载

评论 #40536676 未加载

评论 #40394422 未加载

评论 #40399131 未加载

评论 #40393974 未加载

pixelmonkeyabout 1 year ago

评论 #40387614 未加载

评论 #40396603 未加载

评论 #40389058 未加载

usgroupabout 1 year ago

评论 #40387733 未加载

评论 #40388638 未加载

评论 #40389210 未加载

评论 #40395172 未加载

评论 #40388003 未加载

评论 #40387945 未加载

评论 #40389040 未加载

jtandersonabout 1 year ago

pytnessabout 1 year ago

评论 #40389764 未加载

评论 #40389730 未加载

melqabout 1 year ago

Estimating the amount of unique elements in a set and counting the amount of unique elements in a set are very different things. Cool method, bad headline.

评论 #40389202 未加载

评论 #40394015 未加载

评论 #40387753 未加载

akamoonknightabout 1 year ago

评论 #40389692 未加载

评论 #40387378 未加载

rep_lodsbabout 1 year ago

评论 #40517812 未加载

gnfargblabout 1 year ago

评论 #40389565 未加载

eterevskyabout 1 year ago

Computer scientists invent a memory-efficient way to estimate the size of a subset

评论 #40389792 未加载

评论 #40394094 未加载

评论 #40394150 未加载

kuldeepmeelabout 1 year ago

imoverclockedabout 1 year ago

When do we stop calling this counting and start calling it estimation?

评论 #40387490 未加载

评论 #40388196 未加载

评论 #40387272 未加载

评论 #40387264 未加载

mudiadamzabout 1 year ago

评论 #40389666 未加载

评论 #40389818 未加载

评论 #40389516 未加载

评论 #40395036 未加载

评论 #40389626 未加载

theszabout 1 year ago

saulrhabout 1 year ago

Huh, that's a clever twist on reservoir sampling. Neat.

vitusabout 1 year ago

oertlabout 1 year ago

j_m_babout 1 year ago

评论 #40401431 未加载

pierrebaiabout 1 year ago

burjuiabout 1 year ago

The algorithm uses less memory, but more CPU time because of rather frequent deletions, so it's a tradeoff, not just generally better algorithm, as article may suggest.

评论 #40393134 未加载

unnouinceputabout 1 year ago

"Computer scientists invent an efficient new way to approximate counting large unique entries" - fixed the title

kromemabout 1 year ago

Ah, so Thanos was just conducting a census.

评论 #40389613 未加载

arn3nabout 1 year ago

methyanddinkyabout 1 year ago

112233about 1 year ago

After skimming Knuth's paper — does the algorithm work if values are hashed, that is, the "uniform deviate" is selected deterministically for each unique value of the stream?

评论 #40387651 未加载

leduyquang753about 1 year ago

In the algorithm depicted in the paper, if no elements manage to be eliminated from the set, why not just retry rather than return ⊥?

评论 #40397516 未加载

hum3hum3about 1 year ago

The counting/estimstion technique is rather like a floating point number. An integer exponent k and a mantissa of a population.

lux12 months ago

Whipped up a quick PHP version for fun:<a href="https://github.com/jbroadway/distinctelements">https://github.com/jbroadway/distinctelements</a>

SoftTalkerabout 1 year ago

New interview question: Tell me how you would estimate the number of unique words in Hamlet using only 100 elements.

评论 #40406810 未加载

biscuit1v9about 1 year ago

评论 #40387437 未加载

评论 #40387461 未加载

评论 #40387428 未加载

prerokabout 1 year ago

mring33621about 1 year ago

Feels like reservoir sampling to me

u8about 1 year ago

drycabinetabout 1 year ago

评论 #40388283 未加载

评论 #40394415 未加载

评论 #40388056 未加载

评论 #40388135 未加载

评论 #40388072 未加载

评论 #40388061 未加载

klaussilveiraabout 1 year ago

ngrillyabout 1 year ago

brianhorakhabout 1 year ago

cevingabout 1 year ago

Can some native speaker tell me: is "count" the new "guess"?

评论 #40390835 未加载

评论 #40390622 未加载

评论 #40394233 未加载

deadeyeabout 1 year ago

评论 #40391550 未加载

评论 #40397963 未加载

unbalancedevhabout 1 year ago

I wonder how the error changes if you start a new round just before finishing the stream, compared to if the last word in the stream just fills the buffer and ends a round.

gh0stcloudabout 1 year ago

fnord77about 1 year ago

With other count-distinct algorithms, you can do unions and intersections on the "sets". (theta sketches, even bloom filiters)Can you do this with CVM?

评论 #40410268 未加载

novaRomabout 1 year ago

I wish we had a theory connecting randomness with time/space complexity.Intuitively I think the key to many computational limitations is making use of randomness.

评论 #40388229 未加载

volimKnjigeabout 1 year ago

Which books are in the background, I can't read their titles because picture is blurred too much?? Help please !!! :)

igammaraysabout 1 year ago

asahabout 1 year ago

dist-epochabout 1 year ago

Do they assume any particular distribution of the items?Otherwise Nassim Taleb would like a word with them.

评论 #40387975 未加载

ilya_mabout 1 year ago

评论 #40389742 未加载

cb321about 1 year ago

评论 #40389504 未加载

6510about 1 year ago

Could you pick 1000 randomly and repeat the process endlessly?

OutOfHereabout 1 year ago

评论 #40394143 未加载

评论 #40393817 未加载

volimKnjigeabout 1 year ago

Which books are in background?

minkzillaabout 1 year ago

评论 #40393192 未加载

renonceabout 1 year ago

评论 #40388857 未加载

评论 #40393919 未加载

Tooabout 1 year ago

Now we can finally count how many unique UUIDs there really are! My memory was never enough for this task before.

Gbotexabout 1 year ago

nice read!

aaron695about 1 year ago

xpeabout 1 year ago

评论 #40400336 未加载

theelous3about 1 year ago

> Imagine that you’re sent to a pristine rainforest to carry out a wildlife census.alt-f4