科技回声

Here are the big tips I think the article missed:Use the new string dtype that requires way less memory, see this video: <a href="https://youtu.be/_zoPmQ6J1aE" rel="nofollow">https://youtu.be/_zoPmQ6J1aE</a>. object types are really memory hungry and this new type is a game changer.Use Parquet and leverage column pruning. `usecols` doesn't leverage column pruning. You need to use columnar file formats and specify the `columns` argument with `read_parquet`. You can never truly "skip" a column when using row based file formats like CSV. Spark optimizer does column projections automagically - you need to do them manually with Pandas.Use predicate pushdown filtering to limit the data that's read into the DataFrame, here's a blog post I wrote on this: <a href="https://coiled.io/blog/parquet-column-pruning-predicate-pushdown/" rel="nofollow">https://coiled.io/blog/parquet-column-pruning-predicate-push...</a>Use a technology like Dask (each partition in a Dask DataFrame is a Pandas DataFrame) that doesn't require everything to be stored in memory and can run computations in a streaming manner.

It’s useful to point out pitfalls of lower precision too. I don’t usually see these articles go over that.Operations on a few million rows of float32s can give strange results. For example when summing. Df[“colA”].sum() can be fairly different than df.sort_index()[“colA”].sum(). It’s a trap for the unwary.

Tips for saving memory with Pandas

2 条评论

Tips for saving memory with Pandas

2 条评论