TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ballista: Distributed Compute with Rust, Apache Arrow, and Kubernetes

194 pointsby andygrovealmost 6 years ago

15 comments

s_Hoggalmost 6 years ago
Hang in there mate :) I really don't think you deserve a lot of the crap you've been given in this thread. Someone has to try something new.
评论 #20458262 未加载
sandGorgonalmost 6 years ago
How about Dask - which is fairly production grade and has experimental Arrow integration.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;arrow&#x2F;blob&#x2F;master&#x2F;integration&#x2F;dask&#x2F;Dockerfile" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;arrow&#x2F;blob&#x2F;master&#x2F;integration&#x2F;dask...</a><p>Dask deploys pretty well on k8s - <a href="https:&#x2F;&#x2F;kubernetes.dask.org&#x2F;en&#x2F;latest&#x2F;" rel="nofollow">https:&#x2F;&#x2F;kubernetes.dask.org&#x2F;en&#x2F;latest&#x2F;</a>
评论 #20459142 未加载
评论 #20456929 未加载
dswalteralmost 6 years ago
I&#x27;m actually excited about the possibilities. I&#x27;ve watched DataFusion from afar, and I have spent a decent amount of time wishing the Big Data ecosystem had arrived during a time when something like Rust was a viable option, both for memory and for parallel computing.<p>I use Presto all the time, I love how fully-featured it is, but garbage collection is a non-trivial component of time-to-execute for my queries.
fspearalmost 6 years ago
Are you looking for contributors? I don&#x27;t have any rust, arrow or k8s experience but been looking to learn all 3, I&#x27;ve also been looking to contribute to os projects so I&#x27;m happy to pick up any low hanging fruits if you are interested.<p>I do have a few years of experience with Spark and hadoop if that&#x27;s worth anything.
评论 #20469565 未加载
ohnoesjmralmost 6 years ago
I congratulate the effort, as I always thought that Spark is great, but the fact it was written in Java hinders it quite badly (GC, tons of memory required for the runtime, jar hell (want to use proto3 in your spark job? Good luck)).<p>I do however worry that rust has a high bar of entry.
评论 #20458073 未加载
评论 #20457924 未加载
评论 #20457467 未加载
senderistaalmost 6 years ago
If you’re looking for an approachable distributed query planner, <a href="https:&#x2F;&#x2F;github.com&#x2F;uwescience&#x2F;raco" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;uwescience&#x2F;raco</a> might be a good place to start.
cozosalmost 6 years ago
Most &quot;big data&quot; distributed compute frameworks that come to mind are written in a JVM language, so the focus on Rust is interesting.<p>So then, would Rust be better than a JVM language for a distributed compute framework like Apache Spark?<p>Based on what others said in this thread, these are the primary arguments for Rust:<p>1. JVM GC overhead<p>2. JVM GC pauses<p>3. JVM memory overhead.<p>4. Native code (i.e. Rust) has better raw performance than a JVM language<p>My take on it:<p>(1) I believe Spark basically wrote its own memory management layer with Unsafe that let&#x27;s it bypass the GC [0], so for Dataframe&#x2F;SQL we might be ok here. Hopefully value types are coming to Java&#x2F;Scala soon.<p>(2) Majority of Apache Spark use-cases are batch right? In this case who cares about a little stop-the-world pause here and there, as long as we&#x27;re optimizing the GC for throughput. I recognize that streaming is also a thing, so maybe a non-GC language like Rust is better suited for latency sensitive streaming workloads. Perhaps the Shenandoah GC would be of help here.<p>(3) What&#x27;s the memory overhead of a JVM process, 100-200 MB? That doesn&#x27;t seem too bad to me when clusters these days have terabytes of memory.<p>(4) I wonder how much of an impact performance improvements from Rust will have over Spark&#x27;s optimized code generation [1], which basically converts your code into array loops that utilize cache locality, loop unrolling, and simd. I imagine that most of the gains to be had from a Rust rewrite would come from these &quot;bare metal&#x27; techniques, so it might the case that Spark already has that going for it...<p>Having said that, I can&#x27;t think of any reasons why a compute engine on Rust is a bad idea. Developer productivity and ecosystem perhaps?<p>[0] <a href="https:&#x2F;&#x2F;databricks.com&#x2F;blog&#x2F;2015&#x2F;04&#x2F;28&#x2F;project-tungsten-bringing-spark-closer-to-bare-metal.html" rel="nofollow">https:&#x2F;&#x2F;databricks.com&#x2F;blog&#x2F;2015&#x2F;04&#x2F;28&#x2F;project-tungsten-brin...</a><p>[1] <a href="https:&#x2F;&#x2F;databricks.com&#x2F;blog&#x2F;2016&#x2F;05&#x2F;23&#x2F;apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html" rel="nofollow">https:&#x2F;&#x2F;databricks.com&#x2F;blog&#x2F;2016&#x2F;05&#x2F;23&#x2F;apache-spark-as-a-com...</a>
评论 #20462747 未加载
评论 #20519562 未加载
评论 #20462836 未加载
kylloalmost 6 years ago
This is really cool! What do you see as ideally the primary API for something like this?<p>SQL is great for relational algebra expressions to transform tables but its limited support for variables and control flow constructs make it less than ideal for complex, multi-step data analysis scripts. And when it comes to running statistical tests, regressions, training ML models, it&#x27;s wholly inappropriate.<p>Rust is a very expressive systems programming language, but it&#x27;s unclear at this point how good of a fit it can be for data analysis and statistical programming tasks. It doesn&#x27;t have much in the way of data science libraries yet.<p>Would you potentially add e.g. a Python interpreter on top of such a framework, or would you focus on building out a more fully-featured Rust API for data analysis and even go so far as to suggest that data scientists start to learn and use Rust? (There is some precedence for this with Scala and Spark)
评论 #20462790 未加载
评论 #20501448 未加载
wiradikusumaalmost 6 years ago
So Spark is bloated bcoz of JVM. Does Graal make the point moot?
评论 #20460358 未加载
polskibusalmost 6 years ago
How does this compare to dremio, that also uses Apache Arrow? Is this a competitor?
评论 #20456605 未加载
评论 #20456954 未加载
eb0laalmost 6 years ago
This project needs a &quot;how to help&quot; section <i>urgently</i>
评论 #20469593 未加载
snicker7almost 6 years ago
Would this system support custom aggregates? How would I, for example, create a routine that defines a covariance matrix and have Ballista deal with the necessary map-reduce logic?
StreamBrightalmost 6 years ago
Could this be used without Kubernetes?
评论 #20457240 未加载
blittablealmost 6 years ago
Super cool. Perhaps naive, but how does distributing computation with serialization square with Arrow&#x27;s in-memory design?
评论 #20475646 未加载
m0zgalmost 6 years ago
Serious, non-facetious question: who is this for?
评论 #20457256 未加载