TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Tips for saving memory with Pandas

30 点作者 bigsassy超过 3 年前

2 条评论

MrPowers超过 3 年前
Here are the big tips I think the article missed:<p>Use the new string dtype that requires way less memory, see this video: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;_zoPmQ6J1aE" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;_zoPmQ6J1aE</a>. object types are really memory hungry and this new type is a game changer.<p>Use Parquet and leverage column pruning. `usecols` doesn&#x27;t leverage column pruning. You need to use columnar file formats and specify the `columns` argument with `read_parquet`. You can never truly &quot;skip&quot; a column when using row based file formats like CSV. Spark optimizer does column projections automagically - you need to do them manually with Pandas.<p>Use predicate pushdown filtering to limit the data that&#x27;s read into the DataFrame, here&#x27;s a blog post I wrote on this: <a href="https:&#x2F;&#x2F;coiled.io&#x2F;blog&#x2F;parquet-column-pruning-predicate-pushdown&#x2F;" rel="nofollow">https:&#x2F;&#x2F;coiled.io&#x2F;blog&#x2F;parquet-column-pruning-predicate-push...</a><p>Use a technology like Dask (each partition in a Dask DataFrame is a Pandas DataFrame) that doesn&#x27;t require everything to be stored in memory and can run computations in a streaming manner.
mint2超过 3 年前
It’s useful to point out pitfalls of lower precision too. I don’t usually see these articles go over that.<p>Operations on a few million rows of float32s can give strange results. For example when summing. Df[“colA”].sum() can be fairly different than df.sort_index()[“colA”].sum(). It’s a trap for the unwary.