TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Building a streaming SQL engine with Arrow and DataFusion

112 pointsby necubiabout 1 year ago

11 comments

pantsforbirdsabout 1 year ago
Arrow has been the most exciting piece of technology I&#x27;ve seen in the last few years. The ecosystem being built around it is amazing, and it&#x27;s standardizing a bunch of disparate data ecosystems.<p>The arrow ecosystem nets you a great compute implementation, storage (parquet), and a great RPC framework (arrow flight).
评论 #39751582 未加载
评论 #39748592 未加载
rahulrsabout 1 year ago
SQL streaming engines really seem to be having a moment.<p>As someone who is less familiar with all the players in the space, how should I think about Arroyo vs. streaming databases like Materialize or caching tools like Readyset?
评论 #39745141 未加载
amathabout 1 year ago
Nice work on the performance boost :).<p>How does it compare with things like: 1. <a href="https:&#x2F;&#x2F;github.com&#x2F;bytewax&#x2F;bytewax">https:&#x2F;&#x2F;github.com&#x2F;bytewax&#x2F;bytewax</a> 2. <a href="https:&#x2F;&#x2F;github.com&#x2F;pathwaycom&#x2F;pathway">https:&#x2F;&#x2F;github.com&#x2F;pathwaycom&#x2F;pathway</a><p>I recently read this article (<a href="https:&#x2F;&#x2F;materializedview.io&#x2F;p&#x2F;from-samza-to-flink-a-decade-of-stream" rel="nofollow">https:&#x2F;&#x2F;materializedview.io&#x2F;p&#x2F;from-samza-to-flink-a-decade-o...</a>) about Flink and it commented on Flink grew to fit all of these different use cases (applications, analytics and ETL) with disjoint requirements that Confluent built kafka-streams, ksql and connector for. What of those would you say Arroyo is better suited for?
qazxcvbnmabout 1 year ago
Not exactly on-topic, but does anyone know of SQL-to-SQL optimisers or simplifiers (perhaps DataFusion would be able to do this)? I work with generated query systems and SQL macro systems that make fairly complex queries quite easy to generate, but often times come up with unnecessary joins&#x2F;subqueries etc.<p>I find myself needing to mechanically transform and simplify SQL every now and then, and it hardly seems something out of reach of automation, yet somehow I&#x27;ve never been able to find software that simplifies and transforms SQL source-to-source. When I&#x27;ve last looked, I&#x27;ve only found optimisers for SQL execution plans.
评论 #39747541 未加载
memsetabout 1 year ago
Hi! Just reading the docs, this looks really slick. I had a few questions:<p>- When you create tables, are they always connected to a source? How does that work for the cloud version (ie, source = filesystem? would we just use s3, it seems.) - Does arroyo poll an s3 bucket for new files and automatically ingest? - Are you able to do ALTER TABLE? (What if data, or data types, are mismatched?) - Similarly, am I able to change the primary key (ie, clickhouse&#x27;s ORDER BY or projections?) or change indexes?<p>Any plans for HTTP as a source? (This is what we build and I&#x27;d be happy to prototype an integration!)
评论 #39747033 未加载
benrutterabout 1 year ago
Especially factoring in the streaming capabilities an arrow based SQL database is an exciting prospect!<p>My assumption is that throughput could be increased quite a bit for loading data into arrow based libaries like polars or pandas since data doesn&#x27;t have to be converted. Any idea if that works out?
评论 #39745174 未加载
fifiluraabout 1 year ago
I have one question that i could not quite find an answer to.<p>In Flink you can set timers to wake an event up in arbitrary time without applying a window. Is this supported in Arroyo?
zenbowmanabout 1 year ago
This is a great writeup, I work on batch&#x2F;streaming stuff at Google and I&#x27;m very excited by some of the stuff I see in the Rust ecosystem, Arroyo included.
mgaunardabout 1 year ago
How does it compare to DuckDB, which is an Arrow-compatible OLAP SQL database, easy to embed and just plain awesome?
评论 #39748637 未加载
Pucilowskiabout 1 year ago
how would I go about calling python code as a step, say if I wanted to explore a grid of parameters and fit models accordingly?
评论 #39747362 未加载
esafakabout 1 year ago
Looking forward to NATS support ;)
评论 #39755274 未加载
评论 #39746811 未加载