I'm surprised that there isn't a mention of Google Dataflow aka Apache Beam. The Beam programming model is specifically designed to solve nearly all of the problems this post is addressing.<p>> It is likely that one day, I'll need to shard the data and distribute the processing amongst multiple servers. But no company I've used this with currently has enough data flowing through its analytics system, or intense amounts of real-time processing, to warrant such a complexity.<p>This solution is so over-engineered for low data volume. You could capture all of the business value for 10% of the engineering effort by just dumping this data into a database meant for analytics. And then you'd at least have an answer for things like fixing broken data, making full-history business logic changes, merging events, etc.<p>If you're in AWS, sending your events from snowplow into S3 and then into Redshift/Athena/Presto/PostgreSQL is the way to go.
I’d recommend the event producer send a UUID that is then the primary key on the events table. The producer should also send the timestamp the event occurred.<p>I could be missing something, but that seems to solve both the duplicate event firing (an upsert command based on the UUID makes duplicate event writing a non-issue) and the timing issues.<p>Though I’m still incredibly skeptical of “real-time analytics.” The number of business cases that require actual real-time analysis are pretty limited. High frequency trading and...?
I have a lot issues with this project and I’m on my phone so I can’t outline them all. At quick glance, a few comments have already addressed some of these concerns.<p>But my biggest question is WHY did this person feel it necessary to do this project? From first glance, there is no way Crystal is producing the traffic required to roll this solution. There are dozens of companies that can solve this problem for a few hundred dollars a month and have handled all the problems discussed in this article at serious scale for their customers.<p>The most irritating part is that the developer states in the beginning why he did this: because it’s fun.<p>Disclosure: I’m a founder of company whose core product is a real time analytics platform for web and mobile. Of course I’m going to recommend “Buy” for a small company like this in a Build vs Buy analysis. But when “professional engineers” say they’re building projects “because they are fun”, a lot of people suffer.