科技回声

3 条评论

armon大约 11 年前

My previous job was at an advertising firm, and we used HyperLogLogs for almost all of our real-time analytics infrastructure. They are incredibly space and time efficient. Each "counter" fits into about a single page of memory, and can count into the trillions with <2% error.We developed an extremely high performance server around it (hlld): <a href="https://github.com/armon/hlld" rel="nofollow">https://github.com/armon/hlld</a>.We were typically hitting with with tens of thousands of requests per second across about 50K counters. Although it was benchmarked to >1MM ops a second.Similarly, we also make bloomd, which is an equivalent for using bloom filters, which provide a more set-like abstraction: <a href="https://github.com/armon/bloomd" rel="nofollow">https://github.com/armon/bloomd</a>

millisecond大约 11 年前

HLL also has two nice real-world optimizations possible depending on use-case.We're storing 100,000+ unique counters, but only around 1% have more than 100 unique objects counted. Some of those 1% have millions of records so HLL is very useful. As the HLL itself is a fixed size (~10kb for decent accuracy) regardless of #counted objects, in the small case you can replace the HLL with a pure set of counted values and produce a HLL when it grows beyond a bound. Because you're storing the raw values, the transition to HLL is seamless.Once you've moved beyond raw storage of values there's a harder but still space-saving technique. If you look at the raw bytes of a ~10kb HLL structure with "only" 10's of thousands of counted values around 90% of them will be zero. Below a certain bound it can save a lot of space to have a map of locations and non-zero byte values rather than a raw array of bytes.

评论 #7754720 未加载

iaw大约 11 年前

I must be missing something...

评论 #7753970 未加载

3 条评论

armon大约 11 年前

millisecond大约 11 年前

评论 #7754720 未加载

iaw大约 11 年前

I must be missing something...

评论 #7753970 未加载

HyperLogLog – Cornerstone of a Big Data Infrastructure

3 条评论

HyperLogLog – Cornerstone of a Big Data Infrastructure

3 条评论