TechEcho

3 comments

lmbover 2 years ago

Count min sketches are really neat. A colleague at my old company used them to implement DDoS mitigations in eBPF and wrote it up: <a href="https://blog.cloudflare.com/building-rakelimit/" rel="nofollow">https://blog.cloudflare.com/building-rakelimit/</a>The code is also open source, and I've improved on it a bit: <a href="https://github.com/lmb/socklimit" rel="nofollow">https://github.com/lmb/socklimit</a> Not production ready but a cool idea and implementation.

评论 #32797213 未加载

trompover 2 years ago

This reminds me of the HyperLoglog [1] algorithm for approximating the cardinality of a huge set with limited memory.[1] <a href="https://en.wikipedia.org/wiki/HyperLogLog" rel="nofollow">https://en.wikipedia.org/wiki/HyperLogLog</a>

评论 #32797991 未加载

评论 #32797601 未加载

vanderZwanover 2 years ago

I wonder if this would be a good data stucture to use to approximately find the most common substrings for inputs where doing so exhaustively would lead to too large memory overhead. For example to "train" a dictionary for compression. I was thinking something like:<pre><code> 1. pick a substring length L to look for 2. pick a number M for how many "most common substrings of length L" we wish to find 3. initiate a count-min sketch for approximate counting of substrings, and an array of size M for keeping track of the top M substrings 4. do a linear scan over the input (length N) to count each substring of length L (so ideally we'd be using rolling hashes for the count-min sketch) </code></pre> We probably would have some false positives, and partialy overlapping substrings (I expect the latter is a difficult problem to solve elegantly, unless scanning over the text in steps of L works better than I think it would), but it might be an efficient heuristic.The exhaustive search would be almost the same but use a prefix tree (aka trie) instead of a CMS, which has good average case overhead because tree, but if we're dealing with a "every substring is as unique as possible" worst case scenarios memory overhead could blow up to [input size times fairly big constant].Said worst case scenario is quite easy to construct btw, just start with <a href="https://oeis.org/A142150" rel="nofollow">https://oeis.org/A142150</a> using 8 bit numbers, then modify to exhaust all remaining three-number combinations after exhausting all unique pair combinations, and so on.

Count-Min Sketch (2019)

3 comments

Count-Min Sketch (2019)

3 comments