TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Count-Min Sketch (2019)

44 pointsby colinbover 2 years ago

3 comments

lmbover 2 years ago
Count min sketches are really neat. A colleague at my old company used them to implement DDoS mitigations in eBPF and wrote it up: <a href="https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;building-rakelimit&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;building-rakelimit&#x2F;</a><p>The code is also open source, and I&#x27;ve improved on it a bit: <a href="https:&#x2F;&#x2F;github.com&#x2F;lmb&#x2F;socklimit" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lmb&#x2F;socklimit</a> Not production ready but a cool idea and implementation.
评论 #32797213 未加载
trompover 2 years ago
This reminds me of the HyperLoglog [1] algorithm for approximating the cardinality of a huge set with limited memory.<p>[1] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;HyperLogLog" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;HyperLogLog</a>
评论 #32797991 未加载
评论 #32797601 未加载
vanderZwanover 2 years ago
I wonder if this would be a good data stucture to use to approximately find the most common substrings for inputs where doing so exhaustively would lead to too large memory overhead. For example to &quot;train&quot; a dictionary for compression. I was thinking something like:<p><pre><code> 1. pick a substring length L to look for 2. pick a number M for how many &quot;most common substrings of length L&quot; we wish to find 3. initiate a count-min sketch for approximate counting of substrings, and an array of size M for keeping track of the top M substrings 4. do a linear scan over the input (length N) to count each substring of length L (so ideally we&#x27;d be using rolling hashes for the count-min sketch) </code></pre> We probably would have some false positives, and partialy overlapping substrings (I expect the latter is a difficult problem to solve elegantly, unless scanning over the text in steps of L works better than I think it would), but it might be an efficient heuristic.<p>The exhaustive search would be almost the same but use a prefix tree (aka trie) instead of a CMS, which has good average case overhead because tree, but if we&#x27;re dealing with a &quot;every substring is as unique as possible&quot; worst case scenarios memory overhead could blow up to [input size times fairly big constant].<p>Said worst case scenario is quite easy to construct btw, just start with <a href="https:&#x2F;&#x2F;oeis.org&#x2F;A142150" rel="nofollow">https:&#x2F;&#x2F;oeis.org&#x2F;A142150</a> using 8 bit numbers, then modify to exhaust all remaining three-number combinations after exhausting all unique pair combinations, and so on.