> At the time of writing this blog, Polars is the fastest DataFrame library in the benchmark second to R’s data.table, and Polars is top 3 all tools considered<p>This is a very strange way to write “Polaris is the second fastest” but I guess that doesn’t grab headlines
Probably worth pointing out that the sections on microarchitecture stopped being representative with the pentium pro in the mid 90s. The processor still has a pipeline, but it's much harder to stall it like that.<p><a href="http://www.lighterra.com/papers/modernmicroprocessors/" rel="nofollow">http://www.lighterra.com/papers/modernmicroprocessors/</a> is a good guide (Agner Fog's reference isn't really a book so I don't recommend it for the uninitiated)
[note: see more nuanced comment below]<p>The Julia benchmark two links deep at <a href="https://github.com/h2oai/db-benchmark" rel="nofollow">https://github.com/h2oai/db-benchmark</a> doesn't follow even the most basic performance tips listed at <a href="https://docs.julialang.org/en/v1/manual/performance-tips/" rel="nofollow">https://docs.julialang.org/en/v1/manual/performance-tips/</a>.
My naive interpretation - The canonical Apache Arrow implementation is written in C++ with multiple language bindings like PyArrow. The Rust bindings for Apache Arrow re-implemented the Arrow specification so it can be used as a pure Rust implementation. Andy Grove [1] built two projects on top of Rust-Arrow: 1. DataFusion, a query engine for Arrow that can optimize SQL-like JOIN and GROUP BY queries, and 2. Ballista, clustered DataFusion-like queries (vs. Dask and Spark). DataFusion was integrated into the Apache Arrow Rust project.<p>Ritchie Vink has introduced Polars that also builds upon Rust-Arrow. It offers an Eager API that is an alternative to PyArrow and a Lazy API that is a query engine and optimizer like DataFusion. The linked benchmark is focused on JOIN and GROUP BY queries on large datasets executed on a server/workstation-class machine (125 GB memory). This seems like a specialized use case that pushes the limits of a single developer machine and overlaps with the use case for a dedicated column store (like Redshift) or a distributed batch processing system like Spark/MapReduce.<p>Why Polars over DataFusion? Why Python bindings to Rust-Arrow rather than canonical PyArrow/C++? Is there something wrong with PyArrow?<p>[1] <a href="https://andygrove.io/projects/" rel="nofollow">https://andygrove.io/projects/</a>
Not sure if anything exists but I wish something would do in memory compression + smart disk spillover. Sometimes I want to work with 5-10GB compressed data sets (usually log files) and decompressed that ends up being 10x (plus add data structure overhead). There's stuff like Apache Drill but it's more optimized for multi node than running locally
> This directly shows a clear advantage over Pandas for instance, where there is no clear distinction between a float NaN and missing data, where they really should represent different things.<p>Not true anymore:<p>> Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.<p>> The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).<p>(<a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na" rel="nofollow">https://pandas.pydata.org/pandas-docs/stable/user_guide/miss...</a>)
I'm surprised that data.table is so fast, and that pandas is so slow relative to it. It does explain why I've occasionally had memory issues on ~2GB data files when performing moderately complex functions. (to be fair, it's a relatively old Xeon w/ 12GB ram) I'll have to learn the nuances of data.table syntax now.
It seems like DataFrames.jl still has a ways to go before Julia can close the gap on R/data.table. I don't think these benchmarks include compilation time either?
Looks like a cool project.<p>It's better to separate benchmarking results for big data technologies and small DataFrame technologies.<p>Spark & Dask can perform computations on terabytes of data (thousands of Parquet files in parallel). Most of the other technologies in this article can only handle small datasets.<p>This is especially important for join benchmarking. There are different types of cluster computing joins (broadcast vs shuffle) and they should be benchmarked separately.
This is very cool. I'm happy to see the decision to use Arrow, which should make it almost trivially easy to transfer data into e.g. Julia, and maybe even to create bindings to Polar.
I tried running your code via docker-compose. After some building time, none of the notebooks in examples-folder worked.<p>The notebook with the title "10 minutes to pypolars" was missing the pip command which I had to add to your Dockerfile (actually python-pip3). After rebuilding the whole thing and restarting the notebook, I had to change "!pip" to "!pip3" (was to lazy to add an alias) in the first code-cell which installed all dependencies after running. All the other cells resulted in errors.<p>I suggest to focus on stability and reproducibility first and then on performance.
It is often doubtful if one uses the word "fastest". You often see that one micro-bench lists ten products, then it says "look, I am running in the shortest time".<p>The problem is that, people often compare "apple to orange". Do you know how to correctly use ClickHouse(there are 20-30 engines in ClickHouse to use. Do you compare an in-memory engine to an disk-persistent-design Database?), Spark, Arrow... ? How can you guarantee to do a fair evaluation among ten or twelve products?
Pretty impressed with the data.table benchmarks. The syntax is a little weird and takes getting used to but once you have the basics it’s a great tool.
If this will read a csv that has columns with mixed integers and nulls <i>without</i> converting all of the numbers to float by default, it will replace pandas in my life. 99% of my problems with pandas arise from ints being coerced into floats when a bull shows up.
I've been intrigued about this library, and specifically the possibility about a Python workflow, but a fallback to rust if needed. I mean, I haven't really looked at what the interop is but should work, right?<p>It's not going to happen for now though because the project is still immature and there's zero documentation in Python from what I can see. But it's something in keeping a close eye on, I often work with R and C++ as a fallback when speed is paramount, but I think I'd rather replace C++ with Rust.
I'm guessing Polars and Ballista (<a href="https://github.com/ballista-compute/ballista" rel="nofollow">https://github.com/ballista-compute/ballista</a>) have different goals, but I don't know enough about either to say what those might be. Does anyone know enough about either to explain the differences?
"Polars is based on the Rust native implementation Apache Arrow. Arrow can be seen as middleware software for DBMS, query engines and DataFrame libraries. Arrow provides very cache-coherent data structures and proper missing data handling."<p>This is super cool. Anyone know if Pandas is also planning to adopt Arrow ?
We have been rewriting our stack for multi/many-gpu scale out via python GPU dataframes, but it's clear that smaller workloads and some others would be fine on CPU (and thus free up the GPUs for our other tenants), so having a good CPU impl is exciting, esp if they achieve API compatibility w pandas/dask as RAPIDS and others do. I've been eyeing vauex here (I think the other rust arrow project isn't DF's?), so good to have a contender!<p>I'd love to see a comparison to RAPIDS dataframes for the single GPU case (ex: 2 GB), single GPU bigger-than-memory (ex: 100 GB), and then the same for multi-GPU. We have started to measure as things like "200 GB/s in-memory and 60 GB / s when bigger than memory", to give perspective.
Pandas does seem to be on the out if I'm being honest, and thats coming from someone who has invested heavily in it (backend for my project <a href="http://gluedata.io/" rel="nofollow">http://gluedata.io/</a>). JMO<p>I would happily adopt Polars if the feature set is expansive enough.<p>Pandas is great because its so ubiquitous but I have always felt that it was slow (especially coming from R).<p>One thing that is weirdly terrible in pandas is data types. The coupling with numpy is awkward. Its so dependent on numpy and if pandas isn't moving fast numpy isn't moving at all. I'd be curious to see how Polars handles this. e.g. Null values, datatimes etc.