Polars: Fast DataFrame library for Rust and Python

238 点作者 daureg超过 3 年前

20 条评论

civilized超过 3 年前

In my world, anything that isn't "identical to R's dplyr API but faster" just isn't quite worth switching for. There's absolutely no contest: dplyr has the most productive API and that matters to me more than anything else. But I'm glad to see Polars moves away from the kludgey sprawl of the Pandas API towards the perfection of dplyr... while also being blazingly fast!Now just mix in a bit of DSL so people aren't obligated* to write lame boilerplate like "pandas.blahblah" or "polars.blahblah" just to reference a freaking column, and you're there!*If you like the boilerplate for "production robustness" or whatever, go wild, but analysts and scientists benefit from the option to write more concisely.

评论 #29587081 未加载

评论 #29586182 未加载

评论 #29586301 未加载

评论 #29585987 未加载

评论 #29587700 未加载

评论 #29591640 未加载

评论 #29595474 未加载

评论 #29590113 未加载

评论 #29591699 未加载

评论 #29586230 未加载

gpderetta超过 3 年前

From the python docs:<pre><code> > No Index > They are not needed. Not having them makes things easier. Convince me otherwise </code></pre> Agree completely. first class indices in pandas just complicate everything by having a specially blessed column that can't be manipulated consistently. Secondary indices should be "just" an optimization, while primary indices are a constraint on the whole table (not a single column).The library in general seem interesting. I'm not 100% sold on the syntax (as usual project is called select...), but it is not pandas which is already a huge plus.

评论 #29591891 未加载

sriku超过 3 年前

Hmmm .. in the linked benchmarks [1], DataFrames.jl (Julia library) appears to be fairly competitive.[1] <a href="https://h2oai.github.io/db-benchmark/" rel="nofollow">https://h2oai.github.io/db-benchmark/</a>

abeppu超过 3 年前

There are so many dataframe libraries, many of which have APIs closely following pandas, but not drop-in replacements. I wish we could agree on a standard describing the core parts of what a dataframe must do, such that code depending only on those operations can easily move between dataframes.

评论 #29586405 未加载

评论 #29585723 未加载

评论 #29585473 未加载

评论 #29586014 未加载

评论 #29598251 未加载

评论 #29594038 未加载

评论 #29586668 未加载

评论 #29590131 未加载

vincent-toups超过 3 年前

God please anything to liberate me from pandas, which has one of the wildest API's I've ever had to routinely work with.

Dowwie超过 3 年前

Polars could bring the best of both worlds together if it can codegen python api calls to their Rust equivalent. A user conducts ad-hoc analysis and rapid development with Python. When the work is ready to ship, the user invokes a codegen to transform into Rust-equivalent api calls, resulting in a new rust module.

ahurmazda超过 3 年前

I’ve been using it for the past quarter. In addition to the speed, I’m very pleased with the pyspark-esque api. This means migrating code from research to production is that much easier.

riskneutral超过 3 年前

I'm confused. Polars is built on top of the Rust of bindings for Apache Arrow. Arrow already has Python bindings. What does this project add by creating a new Python binding on top of the Rust binding?

评论 #29587987 未加载

Fiahil超过 3 年前

… and it’s using arrow2, not the official, unsafe, arrow crate. Great, it means we can use it !

optimalonpaper超过 3 年前

I'm reading all these comments and keep asking myself if I'm missing something, because I honestly sort of like pandas' API?Sure dplyr is nice -- it felt that way on rare occasions that I got to use it, at least -- but you get used to anything.So since I'm using python and know it quite well, I'm just more comfortable sticking with python's pandas framework rather than switching to R for data processing

jmakov超过 3 年前

How does compare to Vaex?

评论 #29586977 未加载

评论 #29585417 未加载

unixhero超过 3 年前

What makes Pandas so bad and what makes Dplyr so great?I have used Pandas a lot for data analysis and for data integration duct tape scenarios. For me it has been a low bar for achieving a lot.

评论 #29591738 未加载

评论 #29591565 未加载

评论 #29589753 未加载

评论 #29588962 未加载

the_biot超过 3 年前

I've never seen the term "dataframe" used as it is on this webste, and the commenters here seem to all use it. Judging by the examples it seems to just refer to a "row" from e.g. a CSV or SQL query. So is that all it is, or am I missing something?

评论 #29591512 未加载

评论 #29590456 未加载

评论 #29590872 未加载

rytill超过 3 年前

How would this compare to loading a sqlite database into memory and performing queries with it?

评论 #29588063 未加载

pvitz超过 3 年前

Does anybody here know dataframe systems that are able to handle file sizes bigger than the available RAM? Is polars able to handle this? I am only aware of disk.frame (diskframe.com), but don't know how well it performs.

评论 #29589600 未加载

评论 #29589974 未加载

评论 #29589143 未加载

评论 #29591107 未加载

评论 #29591458 未加载

评论 #29590971 未加载

thenipper超过 3 年前

We've been thinking about trying this out at work for some of our data pipelines/simplified models. The speed/ergonomics look great.

ZeroGravitas超过 3 年前

Is there a plugin to use this as a visidata backend? I quite like their UX.

xiaodai超过 3 年前

It's great to see innovation in this area.

评论 #29586214 未加载

callmerk超过 3 年前

nas超过 3 年前

It looks interesting but phrases like "embarrassingly parallel execution" make my marketing hype detectors trigger. Maybe they could tone down their self promotion just a touch. Also "Even though Polars is completely written in Rust (no runtime overhead!) ...". I find that hard to believe.

评论 #29585647 未加载

评论 #29585250 未加载

评论 #29589475 未加载

评论 #29585544 未加载