TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Pandas 2.0

325 点作者 calpaterson大约 2 年前

16 条评论

mynameisash大约 2 年前
I&#x27;m curious if there will be any appreciable performance gains here that are worthwhile. FWIW, last I checked[0], Polars still smokes Pandas in basically every way.<p>[0] <a href="https:&#x2F;&#x2F;www.pola.rs&#x2F;benchmarks.html" rel="nofollow">https:&#x2F;&#x2F;www.pola.rs&#x2F;benchmarks.html</a>
评论 #35425835 未加载
评论 #35427954 未加载
评论 #35426070 未加载
评论 #35428146 未加载
评论 #35430034 未加载
评论 #35428937 未加载
评论 #35427588 未加载
评论 #35429733 未加载
modriano大约 2 年前
Jeff Reback gave a presentation on the roadmap for Pandas at PyData NYC 2022 [0]. In it, he basically says that pandas is used so widely in industry that big breaking changes are a non-starter, there won&#x27;t be any radical changes to the API, but more performant implementations can&#x2F;will be built into the library (although not set as defaults, at least not for a long while). Not a revolutionary leap, but a move towards making Wes McKinney&#x27;s Arrow work more accessible through pandas.<p>[0] <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=85XdsWz_Q_o">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=85XdsWz_Q_o</a>
评论 #35427869 未加载
benrutter大约 2 年前
I know arrow support is only part way their with this release - but this is a huge deal for Pandas, for standardisation as whole, but also speed ups.<p>Benchmarking that was shared a while back here suggests 2x speed ups in some cases, 30x if you count strings since pandas uses python&#x27;s in-built string data type[1]<p>[1]<a href="https:&#x2F;&#x2F;datapythonista.me&#x2F;blog&#x2F;pandas-20-and-the-arrow-revolution-part-i" rel="nofollow">https:&#x2F;&#x2F;datapythonista.me&#x2F;blog&#x2F;pandas-20-and-the-arrow-revol...</a>
评论 #35425928 未加载
评论 #35425486 未加载
mattrighetti大约 2 年前
What&#x27;s new here [0], saved you a click.<p>[0]: <a href="https:&#x2F;&#x2F;pandas.pydata.org&#x2F;pandas-docs&#x2F;version&#x2F;2.0&#x2F;whatsnew&#x2F;v2.0.0.html" rel="nofollow">https:&#x2F;&#x2F;pandas.pydata.org&#x2F;pandas-docs&#x2F;version&#x2F;2.0&#x2F;whatsnew&#x2F;v...</a>
评论 #35425421 未加载
评论 #35424940 未加载
0cf8612b2e1e大约 2 年前
A quick skim shows a lot of quality of life improvements. Unless I am misreading, it looks like it is still a numpy backed by default. I thought one of the drivers for the 2.0 was to make Arrow the default.
评论 #35425166 未加载
评论 #35424972 未加载
wodenokoto大约 2 年前
strings as objects and integers turning into floats when NaNs are introduced have been a much bigger annoyance to me, than it ought to.<p>I&#x27;m excited to try out the new pyarrow dtypes, but it also sounds confusing that there are now 2 classes of types
评论 #35425357 未加载
评论 #35425128 未加载
henrydark大约 2 年前
Say what you will about sql, polars, pyspark, or whatever else. But nothing beats pandas&#x27; df[col].value_counts().value_counts()
评论 #35425734 未加载
edublancas大约 2 年前
How are people managing the existence of data frame APIs like pandas&#x2F;polars with SQL engines like BigQuery, Snowflake, and DuckDB?<p>Most of my notebooks are a mix of SQL and Python: SQL for most processing, dump the results as a pandas dataframe (via <a href="https:&#x2F;&#x2F;github.com&#x2F;ploomber&#x2F;jupysql">https:&#x2F;&#x2F;github.com&#x2F;ploomber&#x2F;jupysql</a>) and then use Python for operations that are difficult to express with SQL (or that I don&#x27;t know how to do it), so I end up with 80% SQL, 20% Python.<p>Unsure if this is the best workflow but it&#x27;s the most efficient one I&#x27;ve come up with.<p>Disclaimer: my team develops JupySQL.
评论 #35427799 未加载
评论 #35427848 未加载
评论 #35430967 未加载
fauxpause_大约 2 年前
&gt; Accessing a single column of a DataFrame as a Series (e.g. df[&quot;col&quot;]) now always returns a new object every time it is constructed when Copy-on-Write is enabled (not returning multiple times an identical, cached Series object). This ensures that those Series objects correctly follow the Copy-on-Write rules (GH49450)<p>Is this going to mean I can’t do df[‘a’] = 2 to set all values in column a to 2?
评论 #35424954 未加载
cmcconomy大约 2 年前
I&#x27;d love to know when geopandas snaps into alignment
villgax大约 2 年前
How I wished they just changed the .apply for adding progress and parallelization by default instead of resorting to tqdm &amp; swifter&#x2F;dask or what have you
mongol大约 2 年前
Is it correct that I associate to a type of Chinese bears when I read about this project? Or is Pandas an acronym?
评论 #35426644 未加载
评论 #35426628 未加载
评论 #35426869 未加载
bb1234大约 2 年前
If interested in benchmarks comparing different dataframe implementations, here is one:<p><a href="https:&#x2F;&#x2F;h2oai.github.io&#x2F;db-benchmark&#x2F;" rel="nofollow">https:&#x2F;&#x2F;h2oai.github.io&#x2F;db-benchmark&#x2F;</a>
HopenHeyHi大约 2 年前
You want to link to the actual release notes: <a href="https:&#x2F;&#x2F;pandas.pydata.org&#x2F;pandas-docs&#x2F;version&#x2F;2.0&#x2F;whatsnew&#x2F;v2.0.0.html" rel="nofollow">https:&#x2F;&#x2F;pandas.pydata.org&#x2F;pandas-docs&#x2F;version&#x2F;2.0&#x2F;whatsnew&#x2F;v...</a>
nonfamous大约 2 年前
Not a lot of people realize that Pandas was inspired by R, and in particular the Tidyverse model of handling rectangular data frames, created originally by Hadley Wickham. These days R is primarily used by data scientists in academia and certain niche industries like pharma, but its impact goes way beyond its core user base.
评论 #35427382 未加载
评论 #35427502 未加载
评论 #35430045 未加载
评论 #35436218 未加载
lvl102大约 2 年前
I still prefer Stata if I am completely honest with myself.
评论 #35425859 未加载
评论 #35425346 未加载
评论 #35428453 未加载
评论 #35425829 未加载