Pandas 2.0 and the Arrow revolution

314 点作者 ZeroCool2u大约 2 年前

17 条评论

MrPowers大约 2 年前

The Arrow revolution is particularly important for pandas users.pandas DataFrames are persisted in memory. The rule of thumb was for RAM capacity / dataset size in memory to be 5-10x as of 2017. Let's assume that pandas has made improvements and it's more like 2x now.That means you can process datasets that take up 8GB of RAM in memory on a 16GB machine. But 8GB of RAM in memory is a lot different than what you'd expect with pandas.pandas historically persisted string columns as objects, which was wildly inefficient. The new string[pyarrow] column type is around 3.5 times more efficient from what I've seen.Let's say a pandas user can only process a string dataset that has 2GB of data on disk (8GB in memory) on their 16GB machine for a particular analysis. If their dataset grows to 3GB, then the analysis errors out with an out of memory exception.Perhaps this user can now start processing string datasets up to 7GB (3.5 times bigger) with this more efficient string column type. This is a big deal for a lot of pandas users.

评论 #34969889 未加载

评论 #34970908 未加载

评论 #34969618 未加载

yamrzou大约 2 年前

The submission link points to the blog instead of the specific post. It should be: <a href="https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i" rel="nofollow">https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...</a>

评论 #34969068 未加载

markers大约 2 年前

I highly recommend checking out polars if you're tired of pandas' confusing API and poor ergonomics. It takes some time to make the mental switch, but polars is so much more performant and has a much more consistent API (especially if you're used to SQL). I find the code is much more readable and it takes less lines of code to do things.Just beware that polars is not as mature, so take this into consideration if choosing it for your next project. It also currently lacks some of the more advanced data operations, but you can always convert back and forth to pandas for anything special (of course paying a price for the conversion).

评论 #34972875 未加载

评论 #34970290 未加载

loveparade大约 2 年前

There is also Polars [0], which is backed by arrow and a great alternative to pandas.[0] <a href="https://www.pola.rs/" rel="nofollow">https://www.pola.rs/</a>

评论 #34969179 未加载

评论 #34971566 未加载

评论 #34969304 未加载

评论 #34969078 未加载

v8xi大约 2 年前

Finally! I use pandas all the time particularly for handling strings (dna/aa sequences), and tuples (often nested). Some of the most annoying bugs I encounter in my code are a result of random dtype changes in pandas. Things like it auto-converting str -> np.string (which is NOT a string) during pivot operations.There's also all types of annoying workarounds you have to do while tuples as indexes resulting from it converting to a MultiIndex. For examplesrs = pd.Series({('a'):1,('b','c'):2})is a len(2) Series. srs.loc[('b','c')] throws an error while srs.loc[('a')] and srs.loc[[('b','c')]] do not. Not to vent my frustrations, but this maybe gives an idea of why this change is important and I very much look forward to improvements in the area!

评论 #34972832 未加载

pama大约 2 年前

Polars can do a lot of useful processing while streaming a very large dataset without ever having to load in memory much more than one row at a time. Are there any simple ways to achieve such map/reduce tasks with pandas on datasets that may vastly exceed the available RAM?

评论 #34973582 未加载

评论 #34969809 未加载

vegabook大约 2 年前

I do a lot of...<pre><code> some_pandas_object.values </code></pre> to get to the raw numpy, because often dealing with raw np buffer is more efficient or ergonomic. Hopefully losing numpy foundations will not affect (the efficiency of) code which does this.

评论 #34969675 未加载

评论 #34969383 未加载

评论 #34971057 未加载

评论 #34969277 未加载

college_physics大约 2 年前

My feeling is that the pandas community should be bold and consider also overhauling the API besides the internals. Maybe keep the existing API for backward compatibility but rethink what would be desirable for the next decade of pandas so to speak. Borrowing liberally from what works in other ecosystem API's would be the idea. E.g. R, while far from beautiful can be more concise etc.

评论 #34969725 未加载

评论 #34973133 未加载

评论 #34969528 未加载

hopfenspergerj大约 2 年前

Obviously this improves interoperability and the handling of nulls and strings. My naïve understanding is that polars columns are immutable because that makes multiprocessing faster/easier. I’m assuming pandas will not change their api to make columns immutable, so they won’t be targeting multiprocessing like polars?

评论 #34970298 未加载

评论 #34969252 未加载

agolio大约 2 年前

> As mentioned earlier, one of our top priorities is not breaking existing code or APIsThis is doomed then. Pandas API is already extremely bloated

评论 #34971639 未加载

评论 #34971117 未加载

nathan_compton大约 2 年前

I recently switched to Polars because Pandas is so absurdly weird. Polars is much, much better. I'll be interested in seeing how Pandas 2 is.

评论 #34981318 未加载

Wonnk13大约 2 年前

So where does this place Polars? My perhaps incorrect understanding was this (Arrow integration) was a key differentiator of Polars vs Pandas.

评论 #34972300 未加载

jimmyechan大约 2 年前

Yes! Swapping out NumPy with Arrow underneath! Excited to have the performance of Arrow the API of Pandas. Huge win for the data community!

sireat大约 2 年前

The changes to handling strings and Python data types are welcome.However I am curious on how Arrow beats NumPy on regular ints and floats.For the last 10 years I've been under impression that int and float columns in Pandas are basically NumPy ndarrays with extra methods.Then NumPy ndarrays are basically C arrays with well defined vector operations which are often trivially parallelizable.So how does Arrow beat Numpy when calculating<pre><code> mean (int64) 2.03 ms 1.11 ms 1.8x mean (float64) 3.56 ms 1.73 ms 2.1x </code></pre> What is the trick?

评论 #34975420 未加载

评论 #34975494 未加载

est大约 2 年前

Now can mysql/pg provide binary Arrow protocols directly?

评论 #34971543 未加载

Kalanos大约 2 年前

if you were facing memory issues, then why not use numpy memmap? it's effing incredible <a href="https://stackoverflow.com/a/72240526/5739514" rel="nofollow">https://stackoverflow.com/a/72240526/5739514</a>pandas is just for 2D columnar stuff; it's sugar on numpy

Gepsens大约 2 年前

Pandas is still way too slow, if only there was an intégration with datafusion or arrow2

评论 #34981324 未加载