Hey HN,<p>Arroyo is a modern, open-source stream processing engine, that lets anyone write complex queries on event streams just by writing SQL—windowing, aggregating, and joining events with sub-second latency.<p>Today data processing typically happens in batch data warehouses like BigQuery and Snowflake despite the fact that most of the data is coming in as streams. Data teams have to build complex orchestration systems to handle late-arriving data and job failures while trying to minimize latency. Stream processing offers an alternative approach, where the query is compiled into a streaming program that constantly updates as new data comes in, providing low-latency results as soon as the data is available.<p>I started the Arroyo project after spending the past five years building real-time platforms at Lyft and Splunk. I saw first hand how hard it is for users to build correct, reliable pipelines on top of existing systems like Flink and Spark Streaming, and how hard those pipelines are to operate for infra teams. I saw the need for a new system that would be easy enough for any data team to adopt, built on modern foundations and with the lessons of the past decade of research and industry development.<p>Arroyo works by taking SQL queries and compiling them into an optimized streaming dataflow program, a distributed DAG of computation with nodes that read from sources (like Kafka), perform stateful computations, and eventually write results to sinks. That state is consistently snapshotted using a variation of the Chandy-Lamport checkpointing algorithm for fault-tolerance and to enable fast rescaling and updates of the pipelines. The entire system is easy to self-host on Kubernetes and Nomad.<p>See it in action here: <a href="https://www.youtube.com/watch?v=X1Nv0gQy9TA">https://www.youtube.com/watch?v=X1Nv0gQy9TA</a> or follow the getting started guide (<a href="https://doc.arroyo.dev/getting-started">https://doc.arroyo.dev/getting-started</a>) to run it locally.
Unbounded streams, but with watermarks (which right now seem fixed length?):<p><a href="https://doc.arroyo.dev/concepts#watermarks">https://doc.arroyo.dev/concepts#watermarks</a><p>Also works based on fixed, pre-built pipelines. This is all very much in the style of most stream processing platforms today but I hope we’ll continue to move closer as an industry to having our cake and eating it: ingest everything in real-time, while serving any query (with joins) over the full dataset (either incrementally or ad-hoc).
Looks cool. What is the difference between this tools and benthos (<a href="https://www.benthos.dev/" rel="nofollow">https://www.benthos.dev/</a>)?
This is a really exciting project! I recently learned about <a href="https://github.com/vmware/database-stream-processor">https://github.com/vmware/database-stream-processor</a> which builds on a new theoretical foundation and claims to be 9x faster than Flink. It is also written in Rust, and there is a compiler from SQL to Rust executables. Can you comment on the differences?
Between Flink, Spark and KSQL, streaming is very JVM centric. It is nice to see more non JVM projects emerge.<p>I am not sure about your premise that the operations side is difficult. It tends to be submitting a job to a cluster in Flink or Spark.<p>The harder barrier to entry is the functional style of transformation code. Even though other frameworks have it, I think the SQL API as the first class citizen is the bigger differentiator.
In the watermarks documentation it mentions that events arriving after the watermark are dropped. Are there any plans to make this configurable (to disable dropping or trigger exception handling) and/or alertable?<p>I can think of quite a few use cases (particularly in finance) where we'd want late-arrivals to be recorded and possibly incorporated into later or revised results, not silently dropped on the floor.
Very interesting project, Arroyo has been on my watch list for a while! How would you say does Arroyo compare to Apache Flink, i.e. what are pros and cons? For instance, given it's implemented in Rust, I'd assume Arroyo's resource consumption might be lower?<p>(Disclaimer: I work for Decodable, where we build a SaaS based on Flink)
Very exciting, how is feature parity with tinybird?<p><a href="https://www.tinybird.co/" rel="nofollow">https://www.tinybird.co/</a>
This looks great, and it’s very cool that it recommends Nomad to run it in production.<p>I wish more products would support (or at least document how to run on) Nomad.