I wrote one of the fastest DataFrame libraries

389 pointsby polyrandabout 4 years ago

31 comments

qeternityabout 4 years ago

> At the time of writing this blog, Polars is the fastest DataFrame library in the benchmark second to R’s data.table, and Polars is top 3 all tools consideredThis is a very strange way to write “Polaris is the second fastest” but I guess that doesn’t grab headlines

评论 #26453667 未加载

mhh__about 4 years ago

Probably worth pointing out that the sections on microarchitecture stopped being representative with the pentium pro in the mid 90s. The processor still has a pipeline, but it's much harder to stall it like that.<a href="http://www.lighterra.com/papers/modernmicroprocessors/" rel="nofollow">http://www.lighterra.com/papers/modernmicroprocessors/</a> is a good guide (Agner Fog's reference isn't really a book so I don't recommend it for the uninitiated)

tkoolenabout 4 years ago

[note: see more nuanced comment below]The Julia benchmark two links deep at <a href="https://github.com/h2oai/db-benchmark" rel="nofollow">https://github.com/h2oai/db-benchmark</a> doesn't follow even the most basic performance tips listed at <a href="https://docs.julialang.org/en/v1/manual/performance-tips/" rel="nofollow">https://docs.julialang.org/en/v1/manual/performance-tips/</a>.

评论 #26453363 未加载

sradmanabout 4 years ago

My naive interpretation - The canonical Apache Arrow implementation is written in C++ with multiple language bindings like PyArrow. The Rust bindings for Apache Arrow re-implemented the Arrow specification so it can be used as a pure Rust implementation. Andy Grove [1] built two projects on top of Rust-Arrow: 1. DataFusion, a query engine for Arrow that can optimize SQL-like JOIN and GROUP BY queries, and 2. Ballista, clustered DataFusion-like queries (vs. Dask and Spark). DataFusion was integrated into the Apache Arrow Rust project.Ritchie Vink has introduced Polars that also builds upon Rust-Arrow. It offers an Eager API that is an alternative to PyArrow and a Lazy API that is a query engine and optimizer like DataFusion. The linked benchmark is focused on JOIN and GROUP BY queries on large datasets executed on a server/workstation-class machine (125 GB memory). This seems like a specialized use case that pushes the limits of a single developer machine and overlaps with the use case for a dedicated column store (like Redshift) or a distributed batch processing system like Spark/MapReduce.Why Polars over DataFusion? Why Python bindings to Rust-Arrow rather than canonical PyArrow/C++? Is there something wrong with PyArrow?[1] <a href="https://andygrove.io/projects/" rel="nofollow">https://andygrove.io/projects/</a>

评论 #26454585 未加载

nijaveabout 4 years ago

Not sure if anything exists but I wish something would do in memory compression + smart disk spillover. Sometimes I want to work with 5-10GB compressed data sets (usually log files) and decompressed that ends up being 10x (plus add data structure overhead). There's stuff like Apache Drill but it's more optimized for multi node than running locally

评论 #26452248 未加载

评论 #26452760 未加载

评论 #26452318 未加载

评论 #26452347 未加载

评论 #26452292 未加载

评论 #26452227 未加载

评论 #26452209 未加载

评论 #26452710 未加载

评论 #26453006 未加载

评论 #26456437 未加载

评论 #26453283 未加载

Hendriktoabout 4 years ago

> This directly shows a clear advantage over Pandas for instance, where there is no clear distinction between a float NaN and missing data, where they really should represent different things.Not true anymore:> Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.> The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).(<a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na" rel="nofollow">https://pandas.pydata.org/pandas-docs/stable/user_guide/miss...</a>)

评论 #26455287 未加载

评论 #26454940 未加载

ineedasernameabout 4 years ago

I'm surprised that data.table is so fast, and that pandas is so slow relative to it. It does explain why I've occasionally had memory issues on ~2GB data files when performing moderately complex functions. (to be fair, it's a relatively old Xeon w/ 12GB ram) I'll have to learn the nuances of data.table syntax now.

评论 #26453287 未加载

评论 #26454361 未加载

clircleabout 4 years ago

It seems like DataFrames.jl still has a ways to go before Julia can close the gap on R/data.table. I don't think these benchmarks include compilation time either?

评论 #26452894 未加载

评论 #26452891 未加载

评论 #26456446 未加载

MrPowersabout 4 years ago

Looks like a cool project.It's better to separate benchmarking results for big data technologies and small DataFrame technologies.Spark & Dask can perform computations on terabytes of data (thousands of Parquet files in parallel). Most of the other technologies in this article can only handle small datasets.This is especially important for join benchmarking. There are different types of cluster computing joins (broadcast vs shuffle) and they should be benchmarked separately.

评论 #26454167 未加载

SatvikBeriabout 4 years ago

This is very cool. I'm happy to see the decision to use Arrow, which should make it almost trivially easy to transfer data into e.g. Julia, and maybe even to create bindings to Polar.

de6u99erabout 4 years ago

I tried running your code via docker-compose. After some building time, none of the notebooks in examples-folder worked.The notebook with the title "10 minutes to pypolars" was missing the pip command which I had to add to your Dockerfile (actually python-pip3). After rebuilding the whole thing and restarting the notebook, I had to change "!pip" to "!pip3" (was to lazy to add an alias) in the first code-cell which installed all dependencies after running. All the other cells resulted in errors.I suggest to focus on stability and reproducibility first and then on performance.

评论 #26453739 未加载

jinmingjianabout 4 years ago

It is often doubtful if one uses the word "fastest". You often see that one micro-bench lists ten products, then it says "look, I am running in the shortest time".The problem is that, people often compare "apple to orange". Do you know how to correctly use ClickHouse(there are 20-30 engines in ClickHouse to use. Do you compare an in-memory engine to an disk-persistent-design Database?), Spark, Arrow... ? How can you guarantee to do a fair evaluation among ten or twelve products?

danielecookabout 4 years ago

Pretty impressed with the data.table benchmarks. The syntax is a little weird and takes getting used to but once you have the basics it’s a great tool.

评论 #26452294 未加载

评论 #26452413 未加载

DangitBobbyabout 4 years ago

If this will read a csv that has columns with mixed integers and nulls without converting all of the numbers to float by default, it will replace pandas in my life. 99% of my problems with pandas arise from ints being coerced into floats when a bull shows up.

评论 #26453396 未加载

评论 #26453419 未加载

lordgroffabout 4 years ago

I've been intrigued about this library, and specifically the possibility about a Python workflow, but a fallback to rust if needed. I mean, I haven't really looked at what the interop is but should work, right?It's not going to happen for now though because the project is still immature and there's zero documentation in Python from what I can see. But it's something in keeping a close eye on, I often work with R and C++ as a fallback when speed is paramount, but I think I'd rather replace C++ with Rust.

评论 #26453399 未加载

natemcintoshabout 4 years ago

I'm guessing Polars and Ballista (<a href="https://github.com/ballista-compute/ballista" rel="nofollow">https://github.com/ballista-compute/ballista</a>) have different goals, but I don't know enough about either to say what those might be. Does anyone know enough about either to explain the differences?

评论 #26452691 未加载

staticautomaticabout 4 years ago

What’s up with all the Dask benchmarks saying “internal error”? I expected at least some explanation in the post.

评论 #26452888 未加载

评论 #26452476 未加载

nojitoabout 4 years ago

Still mind boggling to me how amazing data.table is

sandGorgonabout 4 years ago

"Polars is based on the Rust native implementation Apache Arrow. Arrow can be seen as middleware software for DBMS, query engines and DataFrame libraries. Arrow provides very cache-coherent data structures and proper missing data handling."This is super cool. Anyone know if Pandas is also planning to adopt Arrow ?

评论 #26453436 未加载

评论 #26462576 未加载

评论 #26453435 未加载

评论 #26453474 未加载

lmeyerovabout 4 years ago

We have been rewriting our stack for multi/many-gpu scale out via python GPU dataframes, but it's clear that smaller workloads and some others would be fine on CPU (and thus free up the GPUs for our other tenants), so having a good CPU impl is exciting, esp if they achieve API compatibility w pandas/dask as RAPIDS and others do. I've been eyeing vauex here (I think the other rust arrow project isn't DF's?), so good to have a contender!I'd love to see a comparison to RAPIDS dataframes for the single GPU case (ex: 2 GB), single GPU bigger-than-memory (ex: 100 GB), and then the same for multi-GPU. We have started to measure as things like "200 GB/s in-memory and 60 GB / s when bigger than memory", to give perspective.

JPKababout 4 years ago

I would love to learn the details of building a Python wrapper on Rust code like you did with pypolars.

评论 #26453308 未加载

davnnabout 4 years ago

Does anyone have insights into why data.table is as fast as it is? PS: Great work with polars!

gthompson1about 4 years ago

Pandas does seem to be on the out if I'm being honest, and thats coming from someone who has invested heavily in it (backend for my project <a href="http://gluedata.io/" rel="nofollow">http://gluedata.io/</a>). JMOI would happily adopt Polars if the feature set is expansive enough.Pandas is great because its so ubiquitous but I have always felt that it was slow (especially coming from R).One thing that is weirdly terrible in pandas is data types. The coupling with numpy is awkward. Its so dependent on numpy and if pandas isn't moving fast numpy isn't moving at all. I'd be curious to see how Polars handles this. e.g. Null values, datatimes etc.

ineedasernameabout 4 years ago

Is this faster than a "traditional" RDBMS like SQL Server, Postgres, Oracle (::uhhg:: but I have to work with it)?

评论 #26454398 未加载

nxpnsvabout 4 years ago

How is data.table so amazing? I didn’t expect it to be that fast...

评论 #26456110 未加载

sdmac99about 4 years ago

I would like it if Vaex was added to the benchmarks.

snicker7about 4 years ago

Looks great, but I would like to see how this compares against proprietary solutions (e.g. kdb+).

Keyframeabout 4 years ago

Arrow and SIMD, how would it work on Arm? I've had quite a success with Gravitons recently.

评论 #26452534 未加载

评论 #26452251 未加载

theopsguyabout 4 years ago

How does it compare to datafusion which is also a rust project that has dataframe support

评论 #26453940 未加载

whoevercaresabout 4 years ago

Now any potential this become next DataBricks?

评论 #26452845 未加载

robin21about 4 years ago

Is there a data frame project for Node.js?

31 comments

qeternityabout 4 years ago

评论 #26453667 未加载

mhh__about 4 years ago

tkoolenabout 4 years ago

评论 #26453363 未加载

sradmanabout 4 years ago

评论 #26454585 未加载

nijaveabout 4 years ago

评论 #26452248 未加载

评论 #26452760 未加载

评论 #26452318 未加载

评论 #26452347 未加载

评论 #26452292 未加载

评论 #26452227 未加载

评论 #26452209 未加载

评论 #26452710 未加载

评论 #26453006 未加载

评论 #26456437 未加载

评论 #26453283 未加载

Hendriktoabout 4 years ago

评论 #26455287 未加载

评论 #26454940 未加载

ineedasernameabout 4 years ago

评论 #26453287 未加载

评论 #26454361 未加载

clircleabout 4 years ago

It seems like DataFrames.jl still has a ways to go before Julia can close the gap on R/data.table. I don't think these benchmarks include compilation time either?

评论 #26452894 未加载

评论 #26452891 未加载

评论 #26456446 未加载

MrPowersabout 4 years ago

评论 #26454167 未加载

SatvikBeriabout 4 years ago

This is very cool. I'm happy to see the decision to use Arrow, which should make it almost trivially easy to transfer data into e.g. Julia, and maybe even to create bindings to Polar.

de6u99erabout 4 years ago

评论 #26453739 未加载

jinmingjianabout 4 years ago

danielecookabout 4 years ago

Pretty impressed with the data.table benchmarks. The syntax is a little weird and takes getting used to but once you have the basics it’s a great tool.

评论 #26452294 未加载

评论 #26452413 未加载

DangitBobbyabout 4 years ago

评论 #26453396 未加载

评论 #26453419 未加载

lordgroffabout 4 years ago

评论 #26453399 未加载

natemcintoshabout 4 years ago

评论 #26452691 未加载

staticautomaticabout 4 years ago

What’s up with all the Dask benchmarks saying “internal error”? I expected at least some explanation in the post.

评论 #26452888 未加载

评论 #26452476 未加载

nojitoabout 4 years ago

Still mind boggling to me how amazing data.table is

sandGorgonabout 4 years ago

评论 #26453436 未加载

评论 #26462576 未加载

评论 #26453435 未加载

评论 #26453474 未加载

lmeyerovabout 4 years ago

JPKababout 4 years ago

I would love to learn the details of building a Python wrapper on Rust code like you did with pypolars.

评论 #26453308 未加载

davnnabout 4 years ago

Does anyone have insights into why data.table is as fast as it is? PS: Great work with polars!

gthompson1about 4 years ago

ineedasernameabout 4 years ago

Is this faster than a "traditional" RDBMS like SQL Server, Postgres, Oracle (::uhhg:: but I have to work with it)?

评论 #26454398 未加载

nxpnsvabout 4 years ago

How is data.table so amazing? I didn’t expect it to be that fast...

评论 #26456110 未加载

sdmac99about 4 years ago

I would like it if Vaex was added to the benchmarks.

snicker7about 4 years ago

Looks great, but I would like to see how this compares against proprietary solutions (e.g. kdb+).

Keyframeabout 4 years ago

Arrow and SIMD, how would it work on Arm? I've had quite a success with Gravitons recently.

评论 #26452534 未加载

评论 #26452251 未加载

theopsguyabout 4 years ago

How does it compare to datafusion which is also a rust project that has dataframe support

评论 #26453940 未加载

whoevercaresabout 4 years ago

Now any potential this become next DataBricks?

评论 #26452845 未加载

robin21about 4 years ago

Is there a data frame project for Node.js?