Streaming joins are hard

117 点作者 danthelion7 个月前

12 条评论

ryzhyk7 个月前

The correct way to think about the problem is in terms of evaluating joins (or any other queries) over changing datasets. And for that you need an engine designed for *incremental* processing from the ground up: algorithms, data structures, the storage layer, and of course the underlying theory. If you don't have such an engine, you're doomed to build layer of hacks, and still fail to do it well.We've been building such an engine at Feldera (<a href="https://www.feldera.com/" rel="nofollow">https://www.feldera.com/</a>), and it can compute joins, aggregates, window queries, and much more fully incrementally. All you have to do is write your queries in SQL, attach your data sources (stream or batch), and watch results get incrementally updated in real-time.

评论 #41950081 未加载

评论 #41951735 未加载

评论 #41951542 未加载

评论 #41951697 未加载

评论 #41954420 未加载

crazygringo7 个月前

Can someone explain what the use case is for streaming joins in the first place?I've written my fair share of joins in SQL. They're indispensable.But I've never come across a situation where I needed to join data from two streams in real time as they're both coming in. I'm not sure I even understand what that's supposed to mean conceptually.It's easy enough to dump streams into a database and query the database but clearly this isn't about that.So what's the use case for joins on raw stream data?

评论 #41949906 未加载

评论 #41950109 未加载

评论 #41949710 未加载

评论 #41955983 未加载

评论 #41956611 未加载

评论 #41949744 未加载

评论 #41951015 未加载

评论 #41949898 未加载

评论 #41951993 未加载

fifilura7 个月前

A couple of years ago Materialize had all the buzz, not sure what is the difference.<a href="https://materialize.com/" rel="nofollow">https://materialize.com/</a>

评论 #41956128 未加载

10000truths7 个月前

Streams are conceptually infinite, yes, but many streaming use cases are dealing with a finite amount of data that's larger than memory but fits on disk. In those cases, you can typically get away with materializing your inputs to a temporary file in order to implement joins, sorts, percentile aggregations, etc.

评论 #41956734 未加载

tombert7 个月前

A large part of my job in the last few months has been in the form figuring out how to optimize joins in Kafka Streams.Kafka Streams, by default, uses either RocksDB or an in-memory system for the join buffer, which is fine but completely devours your RAM, and so I have been writing something more tuned for our work that actually uses Postgres as the state store.It works, but optimizing JOINs is almost as much of an art as it is a science. Trying to optimize caches and predict stuff so you can minimize the cost of latency ends up being a lot of “guess and check” work, particularly if you want to keep memory usage reasonable.

评论 #41955976 未加载

mattxxx7 个月前

JOINs are just hard period. When you're operating at a large scale, you need to be thinking about exactly how to partition + index your data for the types of queries that you want to write with JOINs.Streaming joins are so hard, that they're an anti pattern. If you're using external storage to make it work, then your architecture has probably gone really wrong or you're using streams for something that you shouldn't.

jdelman7 个月前

The ability to express joins in terms of SQL with Estuary is pretty cool. Flink can do a lot of what is described in this post, but you have to set up a lot of intermediate structures, write a lot of Java/Scala, and store your state as protos to support backwards compatibility. Abstracting all of that away would be a huge time saver, but I imagine not having fine grained control over the results and join methods could be frustrating.

评论 #41949156 未加载

评论 #41949261 未加载

neeleshs7 个月前

"Unlike batch tables, streams are infinite. You can't "just wait" for all the rows to arrive before performing a join."I view batch tables as simply a given state of some set of streams at a point in time. Running the same query against "batch" tables at different points in time yields different results (assuming the table is churning over time).

评论 #41950215 未加载

manx7 个月前

I think it should be possible to create a compiler which transforms arbitrary sql queries into a set of triggers and temporary tables to get incremental materialized views which are just normal tables. Those can be indexed, joined etc. no extra services needed. Such an approach should in theory work for multiple relational database systems if it's all adhering to standards.

ctenb7 个月前

If both inputs are ordered by a subset of the join key, you can stream the join operation. It depends on your domain whether this can be made the case, or course. If one of the two join operands is much smaller than the other, you can make the join operation streaming for the larger operand.

bob10297 个月前

> Streaming data isn't static like tables in databases—it's unbounded, constantly updating, and poses significant challenges in managing state.I don't really see the difference between tables & streams. Data in tables changes over time too. You can model a stream as a table with any degree of fidelity you desire. In fact, I believe this could be considered a common approach for implementing streaming abstractions.

评论 #41950955 未加载

hamandcheese7 个月前

It seems intuitive to me that a correct streaming join is impossible without an infinite buffer and strong guarantees on how events are ordered. The number of real world systems offering both of those guarantees is zero. Anyone espousing streaming joins as a general solution should be avoided at all costs, particularly if they have a title that contains "architect" or "enterprise" (god forbid both in the same title).At best, it is a trick to be applied in very specific circumstances.

评论 #41950005 未加载