TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Streaming joins are hard

117 点作者 danthelion7 个月前

12 条评论

ryzhyk7 个月前
The correct way to think about the problem is in terms of evaluating joins (or any other queries) over changing datasets. And for that you need an engine designed for *incremental* processing from the ground up: algorithms, data structures, the storage layer, and of course the underlying theory. If you don&#x27;t have such an engine, you&#x27;re doomed to build layer of hacks, and still fail to do it well.<p>We&#x27;ve been building such an engine at Feldera (<a href="https:&#x2F;&#x2F;www.feldera.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.feldera.com&#x2F;</a>), and it can compute joins, aggregates, window queries, and much more fully incrementally. All you have to do is write your queries in SQL, attach your data sources (stream or batch), and watch results get incrementally updated in real-time.
评论 #41950081 未加载
评论 #41951735 未加载
评论 #41951542 未加载
评论 #41951697 未加载
评论 #41954420 未加载
crazygringo7 个月前
Can someone explain what the use case is for streaming joins in the first place?<p>I&#x27;ve written my fair share of joins in SQL. They&#x27;re indispensable.<p>But I&#x27;ve never come across a situation where I needed to join data from two streams in real time as they&#x27;re both coming in. I&#x27;m not sure I even understand what that&#x27;s supposed to mean conceptually.<p>It&#x27;s easy enough to dump streams into a database and query the database but clearly this isn&#x27;t about that.<p>So what&#x27;s the use case for joins on raw stream data?
评论 #41949906 未加载
评论 #41950109 未加载
评论 #41949710 未加载
评论 #41955983 未加载
评论 #41956611 未加载
评论 #41949744 未加载
评论 #41951015 未加载
评论 #41949898 未加载
评论 #41951993 未加载
fifilura7 个月前
A couple of years ago Materialize had all the buzz, not sure what is the difference.<p><a href="https:&#x2F;&#x2F;materialize.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;materialize.com&#x2F;</a>
评论 #41956128 未加载
10000truths7 个月前
Streams are <i>conceptually</i> infinite, yes, but many streaming use cases are dealing with a finite amount of data that&#x27;s larger than memory but fits on disk. In those cases, you can typically get away with materializing your inputs to a temporary file in order to implement joins, sorts, percentile aggregations, etc.
评论 #41956734 未加载
tombert7 个月前
A large part of my job in the last few months has been in the form figuring out how to optimize joins in Kafka Streams.<p>Kafka Streams, by default, uses either RocksDB or an in-memory system for the join buffer, which is fine but completely devours your RAM, and so I have been writing something more tuned for our work that actually uses Postgres as the state store.<p>It works, but optimizing JOINs is almost as much of an art as it is a science. Trying to optimize caches and predict stuff so you can minimize the cost of latency ends up being a lot of “guess and check” work, particularly if you want to keep memory usage reasonable.
评论 #41955976 未加载
mattxxx7 个月前
JOINs are just hard <i>period</i>. When you&#x27;re operating at a large scale, you need to be thinking about exactly <i>how</i> to partition + index your data for the types of queries that you want to write with JOINs.<p>Streaming joins are <i>so hard</i>, that they&#x27;re an anti pattern. If you&#x27;re using external storage to make it work, then your architecture has probably gone really wrong or you&#x27;re using streams for something that you shouldn&#x27;t.
jdelman7 个月前
The ability to express joins in terms of SQL with Estuary is pretty cool. Flink can do a lot of what is described in this post, but you have to set up a lot of intermediate structures, write a lot of Java&#x2F;Scala, and store your state as protos to support backwards compatibility. Abstracting all of that away would be a huge time saver, but I imagine not having fine grained control over the results and join methods could be frustrating.
评论 #41949156 未加载
评论 #41949261 未加载
neeleshs7 个月前
&quot;Unlike batch tables, streams are infinite. You can&#x27;t &quot;just wait&quot; for all the rows to arrive before performing a join.&quot;<p>I view batch tables as simply a given state of some set of streams at a point in time. Running the same query against &quot;batch&quot; tables at different points in time yields different results (assuming the table is churning over time).
评论 #41950215 未加载
manx7 个月前
I think it should be possible to create a compiler which transforms arbitrary sql queries into a set of triggers and temporary tables to get incremental materialized views which are just normal tables. Those can be indexed, joined etc. no extra services needed. Such an approach should in theory work for multiple relational database systems if it&#x27;s all adhering to standards.
ctenb7 个月前
If both inputs are ordered by a subset of the join key, you can stream the join operation. It depends on your domain whether this can be made the case, or course. If one of the two join operands is much smaller than the other, you can make the join operation streaming for the larger operand.
bob10297 个月前
&gt; Streaming data isn&#x27;t static like tables in databases—it&#x27;s unbounded, constantly updating, and poses significant challenges in managing state.<p>I don&#x27;t really see the difference between tables &amp; streams. Data in tables changes over time too. You can model a stream as a table with any degree of fidelity you desire. In fact, I believe this could be considered a common approach for implementing streaming abstractions.
评论 #41950955 未加载
hamandcheese7 个月前
It seems intuitive to me that a correct streaming join is impossible without an infinite buffer and strong guarantees on how events are ordered. The number of real world systems offering both of those guarantees is zero. Anyone espousing streaming joins as a general solution should be avoided at all costs, particularly if they have a title that contains &quot;architect&quot; or &quot;enterprise&quot; (god forbid both in the same title).<p>At best, it is a trick to be applied in very specific circumstances.
评论 #41950005 未加载