TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

HyperLogLog – Cornerstone of a Big Data Infrastructure

63 点作者 martyhu大约 11 年前

3 条评论

armon大约 11 年前
My previous job was at an advertising firm, and we used HyperLogLogs for almost all of our real-time analytics infrastructure. They are incredibly space and time efficient. Each &quot;counter&quot; fits into about a single page of memory, and can count into the trillions with &lt;2% error.<p>We developed an extremely high performance server around it (hlld): <a href="https://github.com/armon/hlld" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;armon&#x2F;hlld</a>.<p>We were typically hitting with with tens of thousands of requests per second across about 50K counters. Although it was benchmarked to &gt;1MM ops a second.<p>Similarly, we also make bloomd, which is an equivalent for using bloom filters, which provide a more set-like abstraction: <a href="https://github.com/armon/bloomd" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;armon&#x2F;bloomd</a>
millisecond大约 11 年前
HLL also has two nice real-world optimizations possible depending on use-case.<p>We&#x27;re storing 100,000+ unique counters, but only around 1% have more than 100 unique objects counted. Some of those 1% have millions of records so HLL is very useful. As the HLL itself is a fixed size (~10kb for decent accuracy) regardless of #counted objects, in the small case you can replace the HLL with a pure set of counted values and produce a HLL when it grows beyond a bound. Because you&#x27;re storing the raw values, the transition to HLL is seamless.<p>Once you&#x27;ve moved beyond raw storage of values there&#x27;s a harder but still space-saving technique. If you look at the raw bytes of a ~10kb HLL structure with &quot;only&quot; 10&#x27;s of thousands of counted values around 90% of them will be zero. Below a certain bound it can save a lot of space to have a map of locations and non-zero byte values rather than a raw array of bytes.
评论 #7754720 未加载
iaw大约 11 年前
I must be missing something...
评论 #7753970 未加载