Count min sketches are really neat. A colleague at my old company used them to implement DDoS mitigations in eBPF and wrote it up: <a href="https://blog.cloudflare.com/building-rakelimit/" rel="nofollow">https://blog.cloudflare.com/building-rakelimit/</a><p>The code is also open source, and I've improved on it a bit: <a href="https://github.com/lmb/socklimit" rel="nofollow">https://github.com/lmb/socklimit</a> Not production ready but a cool idea and implementation.
This reminds me of the HyperLoglog [1] algorithm for approximating the cardinality of a huge set with limited memory.<p>[1] <a href="https://en.wikipedia.org/wiki/HyperLogLog" rel="nofollow">https://en.wikipedia.org/wiki/HyperLogLog</a>
I wonder if this would be a good data stucture to use to approximately find the most common substrings for inputs where doing so exhaustively would lead to too large memory overhead. For example to "train" a dictionary for compression. I was thinking something like:<p><pre><code> 1. pick a substring length L to look for
2. pick a number M for how many "most common substrings
of length L" we wish to find
3. initiate a count-min sketch for approximate counting
of substrings, and an array of size M for keeping
track of the top M substrings
4. do a linear scan over the input (length N) to count
each substring of length L (so ideally we'd be using
rolling hashes for the count-min sketch)
</code></pre>
We probably would have some false positives, and partialy overlapping substrings (I expect the latter is a difficult problem to solve elegantly, unless scanning over the text in steps of L works better than I think it would), but it might be an efficient heuristic.<p>The exhaustive search would be almost the same but use a prefix tree (aka trie) instead of a CMS, which has good average case overhead because tree, but if we're dealing with a "every substring is as unique as possible" worst case scenarios memory overhead could blow up to [input size times fairly big constant].<p>Said worst case scenario is quite easy to construct btw, just start with <a href="https://oeis.org/A142150" rel="nofollow">https://oeis.org/A142150</a> using 8 bit numbers, then modify to exhaust all remaining three-number combinations after exhausting all unique pair combinations, and so on.