I love the article on hyperloglog! It is really quite good to read even if you're not interested in algorithms. I always liked number theory and I think that it's very interesting that you can guess how many uniques there are by counting how long your longest run of zeroes in a hash is.<p>I suppose this could be broken by injecting in a unique visitor id that would hash to something with an absurd amount of zeroes? That's assuming that the user has control over their user id and that I'm understanding the algorithm correctly.
"We want to better communicate the scale of Reddit to our users."<p>If that's true why did they hide vote numbers on comments and posts? It used to say "xxx upvotes xxx downvotes" now it just gives a number and hides that.
Counting views/impressions in combination with Apache Kafka sounds like the ideal use case for a stream processor like Apache Flink. It supports very large state which can be managed off-hand. This should enable you to count the exact number of unique views in real time with exactly once semantics. Here is a blog post on large scale counting with more details. It also includes a comparison with other streaming technologies like Sanza and Spark: <a href="https://data-artisans.com/blog/counting-in-streams-a-hierarchy-of-needs" rel="nofollow">https://data-artisans.com/blog/counting-in-streams-a-hierarc...</a><p>Also check out this blog post by a Twitter engineer on counting ad impressions: <a href="https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark" rel="nofollow">https://data-artisans.com/blog/extending-the-yahoo-streaming...</a>
So how do they determine whether a user has viewed a post already? I would think that unique counting is accomplished using the hyperloglog counter, but the article says that this decision is made by the Nazar system, which doesn't use the hyperloglog counter in Redis.
Wouldn't it had been easier to simply increment a counter for each visit and then set a short lived cookie in the browser for that post?
And put the spam detection system before the counter increment
Weird thing I have been seeing on Reddit is comment upvotes being off-by-one periodically on page refreshes. Reload, you get 3. Reload again, you get 4. Again, you get 3. Seems like a replication issue?
Very interesting article, thanks for publishing.<p>I have two related questions:
1. I assume the process which reads from Cassandra and puts it back to Redis is parallized if not even distributed. How do you ensure correctness? Implementing 2PC seems extreme overhead. Or do you lock in Redis?
2. What database is used to actually store the view counts? Cassandras Counters are afaik not very reliable...
Slightly OT; but I wish reddit would use traditional forum style replies to push threads up, instead of the positive feedback loop of votes with opinions that agree with majority getting upvotes giving views which give proportionally more upvotes
Probably noob question, but:<p>>> Nazar will then alter the event, adding a Boolean flag indicating whether or not it should be counted, before sending the event back to Kafka.<p>Why don't they just discard it instead of reputting the event back to Kafka?
At <a href="https://trackingco.de/" rel="nofollow">https://trackingco.de/</a> we store events on Redis and compile them daily into a reduced string format, storing these on CouchDB.