At first find this paragraph confusing:<p>> Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list.<p>Why would you delete the word already on the list by flipping coins? Doesn't this reduce the accuracy by counting less words than expected? And will the word be added to the list later?<p>After thinking about it for a while and reading the paper, I've finally developed a good mental model for how this algorithm works as below, which should convince even a high schooler why the algorithm works:<p>1. You are given a streaming set of elements from [n] (a finite set of n distinct elements). Now let's give each element a random uniform real number in [0,1]. This helps us choose the elements we want: if we choose all elements with the number below 0.5, we get about half of the elements.<p>2. Assume for a moment we have unbounded memory. Now we maintain a table of already seen elements, and for each element we keep the <i>last</i> real number attached with that element: so if an element A appears three times with the number 0.3, 0.7, 0.6, we keep A with the number 0.6.<p>3. Our memory is bounded! So we keep only the elements below a threshold 2^-r, that is, 1, 0.5, 0.25, etc. So at first we will keep all elements, but when we don't have enough memory, we will need to filter existing elements according to the threshold. It's easy to see that when we decrease threshold, only elements already in memory will meet the criteria and continue to exist in memory. No other elements in the stream will be below threshold. Also note that whether an element is in memory depends only on its <i>last</i> occurence. It can exist in memory for a while, get dropped because a later element does not meet the threshold, and get back in.<p>4. We don't actually have a real number attached to every element! But we can pretend as if there is one. For each new element X from the stream, we replace the number in memory with its attached number, and we only care if its attached number is below 2^-r or not. If it is, it should be in our memory, and if it's not, it should be out. Once the number is in memory, it's a random number in [0,2^-r] and we care no further.<p>When increasing r, we only care about whether the number is in [0,2^{-r-1}]. This has exactly 1/2 probability. So each number has 1/2 probability of getting out, and 1/2 probability of being kept in memory.<p>5. Now it's easy to see that whether an element is in the memory depends solely on the real number attached to its last occurence. That is, each distinct element has exactly 2^-r probability of being in memory. 2^r multiplied by the number of elements in memory gives a good estimate of number of distinct elements.<p>They criticized previous approaches as relying on hash functions. A simplified version of the approach is as follows:<p>1. Make a hash function that maps distinct elements to different uniform real numbers. So all As compute to 0.3, all Bs compute to 0.7, etc.<p>2. Use that hash function to transform all elements. The same element maps to the same real number, and different elements map to different real numbers.<p>3. Keep a set of N smallest distinct real numbers. This works by inserting numbers from the stream to the set. It's easy to see that adding the same element multiple times has exactly no effect; it's either too big to be in the set or the same as an existing number in the set.<p>4. This gives the Nth smallest real number in the set as K. The number of distinct elements is approximated as N/K.<p>This algorithm is remarkable, but not in that it's efficient or it's new. Both algorithm using a hash function and algorithms without hash functions has been around and is optimal. I assume this algorithm is the first optimal algorithm without hash functions that is very easy to understand even by undergraduates.