TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Arc, an open-source Databricks alternative

175 点作者 seddonm1大约 4 年前

12 条评论

seddonm1大约 4 年前
After being frustrated with building &#x27;traditional&#x27; ETL (Extract-Transform-Load) pipelines - and around the same time as the famous &#x27;Engineers Shouldn’t Write ETL&#x27; blog post - we started building a framework&#x2F;toolkit to allow Technical Business Analysts to be able to build reliable data pipelines without much developer support: Arc. This has been implemented as a Jupyter Notebooks extension.<p>Arc is declarative and currently targets the Apache Spark execution engine but the abstracted API allows replacing execution engines without having to rewrite the logic or intent of the pipeline in future. It supports parameterized notebooks to build complex pipelines which can be executed in CICD environments for safe deployment.<p>We would be interested to hear your feedback.
评论 #26577456 未加载
评论 #26577220 未加载
评论 #26578609 未加载
评论 #26579757 未加载
lordgroff大约 4 年前
I&#x27;m in the process of doing something like this internally, at a smaller scale, and it&#x27;s interesting to see that many of the concepts I&#x27;ve been experimenting with and playing around with are formalized here in a similar manner. My &quot;solution&quot; doesn&#x27;t build on Spark, as I just don&#x27;t have enough data to necessitate it. I think the big difference is really the project&#x27;s SQL first approach, which is probably going to polarize: personally, it&#x27;s a decision I can&#x27;t abide by, but I&#x27;m sure a lot of people will love that.
评论 #26575133 未加载
superyesh大约 4 年前
&gt;Arc is an opinionated framework for defining predictable, repeatable and manageable data transformation pipelines;<p>I am confused by the title `Arc, an open-source Databricks alternative `. One of the main benefits of Databricks is the managed Spark. This isn&#x27;t replacing Databricks as such probably giving an alternative to one of the features in Databricks.
评论 #26577939 未加载
评论 #26575198 未加载
glogla大约 4 年前
I like it a lot, but how large scale can it be?<p>If I want to move whole JDBC-accessible database to warehouse or lakehouse (like Postgres or Oracle to S3 with Iceberg or Snowflake or something), do I have to build a set of configuration for every table, or can I do some wildcards, autodetections, etc?
xupybd大约 4 年前
I like the look of this but worry about adopting something as big as this. That said things tend to grow then I wish I&#x27;d started with something like this.
评论 #26577225 未加载
crimsoneer大约 4 年前
As a data person who despairs at the terrible data pipelines I have to work with, this seems cool! Shall follow with interest.
评论 #26574903 未加载
marcinzm大约 4 年前
I&#x27;m curious how this compares to www.getdbt.com which seems to target a similar audience (technical analysts wanting to do ETL) with a similar approach (SQL first).
评论 #26575234 未加载
0x008大约 4 年前
The idea makes sense, but Databricks exposes the complete Spark API, is that true for this project as well? Spark is a lot more than Spark SQL.
评论 #26577563 未加载
psing大约 4 年前
Can you specify between complete pulls of the source vs incremental based on a timestamp field?
评论 #26584767 未加载
justosophy大约 4 年前
Good to see more attention to this. AWS did a presentation on it last year.
评论 #26575247 未加载
robobro大约 4 年前
Remember when arc was a lisp that powered hackernews? Glad to read she&#x27;s all grown up
评论 #26575428 未加载
ozten大约 4 年前
Arc as a project name on HN ?!? OP account created November 13, 2018... okay, alright.
评论 #26578919 未加载