TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Box – Data Transformation Pipelines in Rust DataFusion

4 点作者 seddonm1超过 3 年前

1 comment

seddonm1超过 3 年前
Months ago I posted a link to Arc (<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=26573930" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=26573930</a>) a declarative method for defining repeatable data pipelines which execute against Apache Spark (<a href="https:&#x2F;&#x2F;spark.apache.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;spark.apache.org&#x2F;</a>).<p>Today I would like to present a proof-of-concept implementation of the Arc declarative ETL framework (<a href="https:&#x2F;&#x2F;arc.tripl.ai" rel="nofollow">https:&#x2F;&#x2F;arc.tripl.ai</a>) against Apache Datafusion (<a href="https:&#x2F;&#x2F;arrow.apache.org&#x2F;datafusion&#x2F;" rel="nofollow">https:&#x2F;&#x2F;arrow.apache.org&#x2F;datafusion&#x2F;</a>) which is an Ansi SQL (Postgres) execution engine based upon Apache Arrow and built with Rust.<p>The idea of providing a declarative &#x27;configuration&#x27; language for defining data pipelines was planned from the beginning of the Arc project to allow changing execution engines without having to rewrite the base business logic (the part that is valuable to your business). Instead, by defining an abstraction layer, we can change the execution engine and run the same logic with different execution characteristics.<p>The benefit of DataFusion over Apache Spark is a significant increase in speed and reduction in execution resource requirements. Even through a Docker-for-Mac inefficiency layer the same job completes in ~4 seconds with DataFusion vs ~24 seconds with Apache Spark (including JVM startup time). Without Docker-for-Mac layer end-to-end execution times of 0.5 second for the same example job (TPC-H) is possible. * the aim is not to start a benchmarking flamewar but to provide some indicative data *.<p>The purpose of this post is to gather feedback from the community whether you would use a tool like this, what features would be required for you to use it (MVP) or whether you would be interested in contributing to the project. I would also like to highlight the excellent work being done by the DataFusion&#x2F;Arrow (and Apache) community for providing such amazing tools to us all as open source projects.<p>Edit: format