TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Tips for saving memory with Pandas

30 pointsby bigsassyover 3 years ago

2 comments

MrPowersover 3 years ago
Here are the big tips I think the article missed:<p>Use the new string dtype that requires way less memory, see this video: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;_zoPmQ6J1aE" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;_zoPmQ6J1aE</a>. object types are really memory hungry and this new type is a game changer.<p>Use Parquet and leverage column pruning. `usecols` doesn&#x27;t leverage column pruning. You need to use columnar file formats and specify the `columns` argument with `read_parquet`. You can never truly &quot;skip&quot; a column when using row based file formats like CSV. Spark optimizer does column projections automagically - you need to do them manually with Pandas.<p>Use predicate pushdown filtering to limit the data that&#x27;s read into the DataFrame, here&#x27;s a blog post I wrote on this: <a href="https:&#x2F;&#x2F;coiled.io&#x2F;blog&#x2F;parquet-column-pruning-predicate-pushdown&#x2F;" rel="nofollow">https:&#x2F;&#x2F;coiled.io&#x2F;blog&#x2F;parquet-column-pruning-predicate-push...</a><p>Use a technology like Dask (each partition in a Dask DataFrame is a Pandas DataFrame) that doesn&#x27;t require everything to be stored in memory and can run computations in a streaming manner.
mint2over 3 years ago
It’s useful to point out pitfalls of lower precision too. I don’t usually see these articles go over that.<p>Operations on a few million rows of float32s can give strange results. For example when summing. Df[“colA”].sum() can be fairly different than df.sort_index()[“colA”].sum(). It’s a trap for the unwary.