Arrow has been the most exciting piece of technology I've seen in the last few years. The ecosystem being built around it is amazing, and it's standardizing a bunch of disparate data ecosystems.<p>The arrow ecosystem nets you a great compute implementation, storage (parquet), and a great RPC framework (arrow flight).
SQL streaming engines really seem to be having a moment.<p>As someone who is less familiar with all the players in the space, how should I think about Arroyo vs. streaming databases like Materialize or caching tools like Readyset?
Nice work on the performance boost :).<p>How does it compare with things like:
1. <a href="https://github.com/bytewax/bytewax">https://github.com/bytewax/bytewax</a>
2. <a href="https://github.com/pathwaycom/pathway">https://github.com/pathwaycom/pathway</a><p>I recently read this article (<a href="https://materializedview.io/p/from-samza-to-flink-a-decade-of-stream" rel="nofollow">https://materializedview.io/p/from-samza-to-flink-a-decade-o...</a>) about Flink and it commented on Flink grew to fit all of these different use cases (applications, analytics and ETL) with disjoint requirements that Confluent built kafka-streams, ksql and connector for. What of those would you say Arroyo is better suited for?
Not exactly on-topic, but does anyone know of SQL-to-SQL optimisers or simplifiers (perhaps DataFusion would be able to do this)? I work with generated query systems and SQL macro systems that make fairly complex queries quite easy to generate, but often times come up with unnecessary joins/subqueries etc.<p>I find myself needing to mechanically transform and simplify SQL every now and then, and it hardly seems something out of reach of automation, yet somehow I've never been able to find software that simplifies and transforms SQL source-to-source. When I've last looked, I've only found optimisers for SQL execution plans.
Hi! Just reading the docs, this looks really slick. I had a few questions:<p>- When you create tables, are they always connected to a source? How does that work for the cloud version (ie, source = filesystem? would we just use s3, it seems.)
- Does arroyo poll an s3 bucket for new files and automatically ingest?
- Are you able to do ALTER TABLE? (What if data, or data types, are mismatched?)
- Similarly, am I able to change the primary key (ie, clickhouse's ORDER BY or projections?) or change indexes?<p>Any plans for HTTP as a source? (This is what we build and I'd be happy to prototype an integration!)
Especially factoring in the streaming capabilities an arrow based SQL database is an exciting prospect!<p>My assumption is that throughput could be increased quite a bit for loading data into arrow based libaries like polars or pandas since data doesn't have to be converted. Any idea if that works out?
I have one question that i could not quite find an answer to.<p>In Flink you can set timers to wake an event up in arbitrary time without applying a window. Is this supported in Arroyo?
This is a great writeup, I work on batch/streaming stuff at Google and I'm very excited by some of the stuff I see in the Rust ecosystem, Arroyo included.