How about Dask - which is fairly production grade and has experimental Arrow integration.<p><a href="https://github.com/apache/arrow/blob/master/integration/dask/Dockerfile" rel="nofollow">https://github.com/apache/arrow/blob/master/integration/dask...</a><p>Dask deploys pretty well on k8s - <a href="https://kubernetes.dask.org/en/latest/" rel="nofollow">https://kubernetes.dask.org/en/latest/</a>
I'm actually excited about the possibilities. I've watched DataFusion from afar, and I have spent a decent amount of time wishing the Big Data ecosystem had arrived during a time when something like Rust was a viable option, both for memory and for parallel computing.<p>I use Presto all the time, I love how fully-featured it is, but garbage collection is a non-trivial component of time-to-execute for my queries.
Are you looking for contributors? I don't have any rust, arrow or k8s experience but been looking to learn all 3, I've also been looking to contribute to os projects so I'm happy to pick up any low hanging fruits if you are interested.<p>I do have a few years of experience with Spark and hadoop if that's worth anything.
I congratulate the effort, as I always thought that Spark is great, but the fact it was written in Java hinders it quite badly (GC, tons of memory required for the runtime, jar hell (want to use proto3 in your spark job? Good luck)).<p>I do however worry that rust has a high bar of entry.
If you’re looking for an approachable distributed query planner, <a href="https://github.com/uwescience/raco" rel="nofollow">https://github.com/uwescience/raco</a> might be a good place to start.
Most "big data" distributed compute frameworks that come to mind are written in a JVM language, so the focus on Rust is interesting.<p>So then, would Rust be better than a JVM language for a distributed compute framework like Apache Spark?<p>Based on what others said in this thread, these are the primary arguments for Rust:<p>1. JVM GC overhead<p>2. JVM GC pauses<p>3. JVM memory overhead.<p>4. Native code (i.e. Rust) has better raw performance than a JVM language<p>My take on it:<p>(1) I believe Spark basically wrote its own memory management layer with Unsafe that let's it bypass the GC [0], so for Dataframe/SQL we might be ok here. Hopefully value types are coming to Java/Scala soon.<p>(2) Majority of Apache Spark use-cases are batch right? In this case who cares about a little stop-the-world pause here and there, as long as we're optimizing the GC for throughput. I recognize that streaming is also a thing, so maybe a non-GC language like Rust is better suited for latency sensitive streaming workloads. Perhaps the Shenandoah GC would be of help here.<p>(3) What's the memory overhead of a JVM process, 100-200 MB? That doesn't seem too bad to me when clusters these days have terabytes of memory.<p>(4) I wonder how much of an impact performance improvements from Rust will have over Spark's optimized code generation [1], which basically converts your code into array loops that utilize cache locality, loop unrolling, and simd. I imagine that most of the gains to be had from a Rust rewrite would come from these "bare metal' techniques, so it might the case that Spark already has that going for it...<p>Having said that, I can't think of any reasons why a compute engine on Rust is a bad idea. Developer productivity and ecosystem perhaps?<p>[0] <a href="https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html" rel="nofollow">https://databricks.com/blog/2015/04/28/project-tungsten-brin...</a><p>[1] <a href="https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html" rel="nofollow">https://databricks.com/blog/2016/05/23/apache-spark-as-a-com...</a>
This is really cool!
What do you see as ideally the primary API for something like this?<p>SQL is great for relational algebra expressions to transform tables but its limited support for variables and control flow constructs make it less than ideal for complex, multi-step data analysis scripts. And when it comes to running statistical tests, regressions, training ML models, it's wholly inappropriate.<p>Rust is a very expressive systems programming language, but it's unclear at this point how good of a fit it can be for data analysis and statistical programming tasks. It doesn't have much in the way of data science libraries yet.<p>Would you potentially add e.g. a Python interpreter on top of such a framework, or would you focus on building out a more fully-featured Rust API for data analysis and even go so far as to suggest that data scientists start to learn and use Rust? (There is some precedence for this with Scala and Spark)
Would this system support custom aggregates? How would I, for example, create a routine that defines a covariance matrix and have Ballista deal with the necessary map-reduce logic?