TechEcho

Hey HN,Arroyo is a modern, open-source stream processing engine, that lets anyone write complex queries on event streams just by writing SQL—windowing, aggregating, and joining events with sub-second latency.Today data processing typically happens in batch data warehouses like BigQuery and Snowflake despite the fact that most of the data is coming in as streams. Data teams have to build complex orchestration systems to handle late-arriving data and job failures while trying to minimize latency. Stream processing offers an alternative approach, where the query is compiled into a streaming program that constantly updates as new data comes in, providing low-latency results as soon as the data is available.I started the Arroyo project after spending the past five years building real-time platforms at Lyft and Splunk. I saw first hand how hard it is for users to build correct, reliable pipelines on top of existing systems like Flink and Spark Streaming, and how hard those pipelines are to operate for infra teams. I saw the need for a new system that would be easy enough for any data team to adopt, built on modern foundations and with the lessons of the past decade of research and industry development.Arroyo works by taking SQL queries and compiling them into an optimized streaming dataflow program, a distributed DAG of computation with nodes that read from sources (like Kafka), perform stateful computations, and eventually write results to sinks. That state is consistently snapshotted using a variation of the Chandy-Lamport checkpointing algorithm for fault-tolerance and to enable fast rescaling and updates of the pipelines. The entire system is easy to self-host on Kubernetes and Nomad.See it in action here: <a href="https://www.youtube.com/watch?v=X1Nv0gQy9TA">https://www.youtube.com/watch?v=X1Nv0gQy9TA</a> or follow the getting started guide (<a href="https://doc.arroyo.dev/getting-started">https://doc.arroyo.dev/getting-started</a>) to run it locally.

12 comments

thomalmost 2 years ago

Unbounded streams, but with watermarks (which right now seem fixed length?):<a href="https://doc.arroyo.dev/concepts#watermarks">https://doc.arroyo.dev/concepts#watermarks</a>Also works based on fixed, pre-built pipelines. This is all very much in the style of most stream processing platforms today but I hope we’ll continue to move closer as an industry to having our cake and eating it: ingest everything in real-time, while serving any query (with joins) over the full dataset (either incrementally or ad-hoc).

评论 #36228905 未加载

yevpatsalmost 2 years ago

Looks cool. What is the difference between this tools and benthos (<a href="https://www.benthos.dev/" rel="nofollow">https://www.benthos.dev/</a>)?

评论 #36233919 未加载

sorenbsalmost 2 years ago

This is a really exciting project! I recently learned about <a href="https://github.com/vmware/database-stream-processor">https://github.com/vmware/database-stream-processor</a> which builds on a new theoretical foundation and claims to be 9x faster than Flink. It is also written in Rust, and there is a compiler from SQL to Rust executables. Can you comment on the differences?

评论 #36229789 未加载

评论 #36225649 未加载

benjaminwoottonalmost 2 years ago

Between Flink, Spark and KSQL, streaming is very JVM centric. It is nice to see more non JVM projects emerge.I am not sure about your premise that the operations side is difficult. It tends to be submitting a job to a cluster in Flink or Spark.The harder barrier to entry is the functional style of transformation code. Even though other frameworks have it, I think the SQL API as the first class citizen is the bigger differentiator.

jstyalmost 2 years ago

In the watermarks documentation it mentions that events arriving after the watermark are dropped. Are there any plans to make this configurable (to disable dropping or trigger exception handling) and/or alertable?I can think of quite a few use cases (particularly in finance) where we'd want late-arrivals to be recorded and possibly incorporated into later or revised results, not silently dropped on the floor.

评论 #36228940 未加载

gunnarmorlingalmost 2 years ago

Very interesting project, Arroyo has been on my watch list for a while! How would you say does Arroyo compare to Apache Flink, i.e. what are pros and cons? For instance, given it's implemented in Rust, I'd assume Arroyo's resource consumption might be lower?(Disclaimer: I work for Decodable, where we build a SaaS based on Flink)

dangoodmanUTalmost 2 years ago

Would love to know how you look at tools list Materialize in comparison

fasteoalmost 2 years ago

Slightly off-topic."Arroyo" is a Spanish word meaning creek, or stream

评论 #36229296 未加载

评论 #36228783 未加载

KRAKRISMOTTalmost 2 years ago

Very exciting, how is feature parity with tinybird?<a href="https://www.tinybird.co/" rel="nofollow">https://www.tinybird.co/</a>

评论 #36226649 未加载

httgpalmost 2 years ago

This looks great, and it’s very cool that it recommends Nomad to run it in production.I wish more products would support (or at least document how to run on) Nomad.

评论 #36229223 未加载

laurensralmost 2 years ago

Would Arroyo be an alternative to Confluent KSQL?

评论 #36229050 未加载

评论 #36225377 未加载

trevynalmost 2 years ago

Any interest in redoing the web console in Rust? 8)

评论 #36224177 未加载

12 comments

thomalmost 2 years ago

评论 #36228905 未加载

yevpatsalmost 2 years ago

Looks cool. What is the difference between this tools and benthos (<a href="https://www.benthos.dev/" rel="nofollow">https://www.benthos.dev/</a>)?

评论 #36233919 未加载

sorenbsalmost 2 years ago

评论 #36229789 未加载

评论 #36225649 未加载

benjaminwoottonalmost 2 years ago

jstyalmost 2 years ago

评论 #36228940 未加载

gunnarmorlingalmost 2 years ago

dangoodmanUTalmost 2 years ago

Would love to know how you look at tools list Materialize in comparison

fasteoalmost 2 years ago

Slightly off-topic."Arroyo" is a Spanish word meaning creek, or stream

评论 #36229296 未加载

评论 #36228783 未加载

KRAKRISMOTTalmost 2 years ago

Very exciting, how is feature parity with tinybird?<a href="https://www.tinybird.co/" rel="nofollow">https://www.tinybird.co/</a>

评论 #36226649 未加载

httgpalmost 2 years ago

This looks great, and it’s very cool that it recommends Nomad to run it in production.I wish more products would support (or at least document how to run on) Nomad.

Show HN: Arroyo – Write SQL on streaming data

12 comments

Show HN: Arroyo – Write SQL on streaming data

12 comments