科技回声

12 条评论

seddonm1大约 4 年前

After being frustrated with building 'traditional' ETL (Extract-Transform-Load) pipelines - and around the same time as the famous 'Engineers Shouldn’t Write ETL' blog post - we started building a framework/toolkit to allow Technical Business Analysts to be able to build reliable data pipelines without much developer support: Arc. This has been implemented as a Jupyter Notebooks extension.<p>Arc is declarative and currently targets the Apache Spark execution engine but the abstracted API allows replacing execution engines without having to rewrite the logic or intent of the pipeline in future. It supports parameterized notebooks to build complex pipelines which can be executed in CICD environments for safe deployment.<p>We would be interested to hear your feedback.

评论 #26577456 未加载

评论 #26577220 未加载

评论 #26578609 未加载

评论 #26579757 未加载

lordgroff大约 4 年前

I'm in the process of doing something like this internally, at a smaller scale, and it's interesting to see that many of the concepts I've been experimenting with and playing around with are formalized here in a similar manner. My "solution" doesn't build on Spark, as I just don't have enough data to necessitate it. I think the big difference is really the project's SQL first approach, which is probably going to polarize: personally, it's a decision I can't abide by, but I'm sure a lot of people will love that.

评论 #26575133 未加载

superyesh大约 4 年前

>Arc is an opinionated framework for defining predictable, repeatable and manageable data transformation pipelines;<p>I am confused by the title `Arc, an open-source Databricks alternative `. One of the main benefits of Databricks is the managed Spark. This isn't replacing Databricks as such probably giving an alternative to one of the features in Databricks.

评论 #26577939 未加载

评论 #26575198 未加载

glogla大约 4 年前

I like it a lot, but how large scale can it be?<p>If I want to move whole JDBC-accessible database to warehouse or lakehouse (like Postgres or Oracle to S3 with Iceberg or Snowflake or something), do I have to build a set of configuration for every table, or can I do some wildcards, autodetections, etc?

xupybd大约 4 年前

I like the look of this but worry about adopting something as big as this. That said things tend to grow then I wish I'd started with something like this.

评论 #26577225 未加载

crimsoneer大约 4 年前

As a data person who despairs at the terrible data pipelines I have to work with, this seems cool! Shall follow with interest.

评论 #26574903 未加载

marcinzm大约 4 年前

I'm curious how this compares to www.getdbt.com which seems to target a similar audience (technical analysts wanting to do ETL) with a similar approach (SQL first).

评论 #26575234 未加载

0x008大约 4 年前

The idea makes sense, but Databricks exposes the complete Spark API, is that true for this project as well? Spark is a lot more than Spark SQL.

评论 #26577563 未加载

psing大约 4 年前

Can you specify between complete pulls of the source vs incremental based on a timestamp field?

评论 #26584767 未加载

justosophy大约 4 年前

Good to see more attention to this. AWS did a presentation on it last year.

评论 #26575247 未加载

robobro大约 4 年前

Remember when arc was a lisp that powered hackernews? Glad to read she's all grown up

评论 #26575428 未加载

ozten大约 4 年前

Arc as a project name on HN ?!? OP account created November 13, 2018... okay, alright.

评论 #26578919 未加载

12 条评论

seddonm1大约 4 年前

评论 #26577456 未加载

评论 #26577220 未加载

评论 #26578609 未加载

评论 #26579757 未加载

lordgroff大约 4 年前

评论 #26575133 未加载

superyesh大约 4 年前

评论 #26577939 未加载

评论 #26575198 未加载

glogla大约 4 年前

xupybd大约 4 年前

I like the look of this but worry about adopting something as big as this. That said things tend to grow then I wish I'd started with something like this.

评论 #26577225 未加载

crimsoneer大约 4 年前

As a data person who despairs at the terrible data pipelines I have to work with, this seems cool! Shall follow with interest.

评论 #26574903 未加载

marcinzm大约 4 年前

I'm curious how this compares to www.getdbt.com which seems to target a similar audience (technical analysts wanting to do ETL) with a similar approach (SQL first).

评论 #26575234 未加载

0x008大约 4 年前

The idea makes sense, but Databricks exposes the complete Spark API, is that true for this project as well? Spark is a lot more than Spark SQL.

评论 #26577563 未加载

psing大约 4 年前

Can you specify between complete pulls of the source vs incremental based on a timestamp field?

评论 #26584767 未加载

justosophy大约 4 年前

Good to see more attention to this. AWS did a presentation on it last year.

评论 #26575247 未加载

robobro大约 4 年前

Remember when arc was a lisp that powered hackernews? Glad to read she's all grown up

评论 #26575428 未加载

ozten大约 4 年前

Arc as a project name on HN ?!? OP account created November 13, 2018... okay, alright.

评论 #26578919 未加载

Show HN: Arc, an open-source Databricks alternative

12 条评论

Show HN: Arc, an open-source Databricks alternative

12 条评论