I'm surprised that there isn't a mention of Google Dataflow aka Apache Beam. The Beam programming model is specifically designed to solve nearly all of the problems this post is addressing.<p>> It is likely that one day, I'll need to shard the data and distribute the processing amongst multiple servers. But no company I've used this with currently has enough data flowing through its analytics system, or intense amounts of real-time processing, to warrant such a complexity.<p>This solution is so over-engineered for low data volume. You could capture all of the business value for 10% of the engineering effort by just dumping this data into a database meant for analytics. And then you'd at least have an answer for things like fixing broken data, making full-history business logic changes, merging events, etc.<p>If you're in AWS, sending your events from snowplow into S3 and then into Redshift/Athena/Presto/PostgreSQL is the way to go.