Pandas 2.0

325 点作者 calpaterson大约 2 年前

16 条评论

I'm curious if there will be any appreciable performance gains here that are worthwhile. FWIW, last I checked[0], Polars still smokes Pandas in basically every way.[0] <a href="https://www.pola.rs/benchmarks.html" rel="nofollow">https://www.pola.rs/benchmarks.html</a>

评论 #35425835 未加载

评论 #35427954 未加载

评论 #35426070 未加载

评论 #35428146 未加载

评论 #35430034 未加载

评论 #35428937 未加载

评论 #35427588 未加载

评论 #35429733 未加载

modriano大约 2 年前

Jeff Reback gave a presentation on the roadmap for Pandas at PyData NYC 2022 [0]. In it, he basically says that pandas is used so widely in industry that big breaking changes are a non-starter, there won't be any radical changes to the API, but more performant implementations can/will be built into the library (although not set as defaults, at least not for a long while). Not a revolutionary leap, but a move towards making Wes McKinney's Arrow work more accessible through pandas.[0] <a href="https://www.youtube.com/watch?v=85XdsWz_Q_o">https://www.youtube.com/watch?v=85XdsWz_Q_o</a>

评论 #35427869 未加载

benrutter大约 2 年前

I know arrow support is only part way their with this release - but this is a huge deal for Pandas, for standardisation as whole, but also speed ups.Benchmarking that was shared a while back here suggests 2x speed ups in some cases, 30x if you count strings since pandas uses python's in-built string data type[1][1]<a href="https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i" rel="nofollow">https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...</a>

评论 #35425928 未加载

评论 #35425486 未加载

mattrighetti大约 2 年前

What's new here [0], saved you a click.[0]: <a href="https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v2.0.0.html" rel="nofollow">https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v...</a>

评论 #35425421 未加载

评论 #35424940 未加载

0cf8612b2e1e大约 2 年前

A quick skim shows a lot of quality of life improvements. Unless I am misreading, it looks like it is still a numpy backed by default. I thought one of the drivers for the 2.0 was to make Arrow the default.

评论 #35425166 未加载

评论 #35424972 未加载

wodenokoto大约 2 年前

strings as objects and integers turning into floats when NaNs are introduced have been a much bigger annoyance to me, than it ought to.I'm excited to try out the new pyarrow dtypes, but it also sounds confusing that there are now 2 classes of types

评论 #35425357 未加载

评论 #35425128 未加载

henrydark大约 2 年前

Say what you will about sql, polars, pyspark, or whatever else. But nothing beats pandas' df[col].value_counts().value_counts()

评论 #35425734 未加载

edublancas大约 2 年前

How are people managing the existence of data frame APIs like pandas/polars with SQL engines like BigQuery, Snowflake, and DuckDB?Most of my notebooks are a mix of SQL and Python: SQL for most processing, dump the results as a pandas dataframe (via <a href="https://github.com/ploomber/jupysql">https://github.com/ploomber/jupysql</a>) and then use Python for operations that are difficult to express with SQL (or that I don't know how to do it), so I end up with 80% SQL, 20% Python.Unsure if this is the best workflow but it's the most efficient one I've come up with.Disclaimer: my team develops JupySQL.

评论 #35427799 未加载

评论 #35427848 未加载

评论 #35430967 未加载

fauxpause_大约 2 年前

> Accessing a single column of a DataFrame as a Series (e.g. df["col"]) now always returns a new object every time it is constructed when Copy-on-Write is enabled (not returning multiple times an identical, cached Series object). This ensures that those Series objects correctly follow the Copy-on-Write rules (GH49450)Is this going to mean I can’t do df[‘a’] = 2 to set all values in column a to 2?

评论 #35424954 未加载

cmcconomy大约 2 年前

I'd love to know when geopandas snaps into alignment

villgax大约 2 年前

How I wished they just changed the .apply for adding progress and parallelization by default instead of resorting to tqdm & swifter/dask or what have you

mongol大约 2 年前

Is it correct that I associate to a type of Chinese bears when I read about this project? Or is Pandas an acronym?

评论 #35426644 未加载

评论 #35426628 未加载

评论 #35426869 未加载

bb1234大约 2 年前

If interested in benchmarks comparing different dataframe implementations, here is one:<a href="https://h2oai.github.io/db-benchmark/" rel="nofollow">https://h2oai.github.io/db-benchmark/</a>

HopenHeyHi大约 2 年前

You want to link to the actual release notes: <a href="https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v2.0.0.html" rel="nofollow">https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v...</a>

nonfamous大约 2 年前

Not a lot of people realize that Pandas was inspired by R, and in particular the Tidyverse model of handling rectangular data frames, created originally by Hadley Wickham. These days R is primarily used by data scientists in academia and certain niche industries like pharma, but its impact goes way beyond its core user base.

评论 #35427382 未加载

评论 #35427502 未加载

评论 #35430045 未加载

评论 #35436218 未加载

lvl102大约 2 年前

I still prefer Stata if I am completely honest with myself.

评论 #35425859 未加载

评论 #35425346 未加载

评论 #35428453 未加载

评论 #35425829 未加载