PandaPy has the speed of NumPy and the usability of Pandas

159 点作者 firedup超过 5 年前

https://github.com/firmai/pandapyPandaPy has the speed of NumPy and the usability of Pandas (10x to 50x faster)

13 条评论

shoyer超过 5 年前

It's a lovely idea to build pandas like functionality on top of NumPy's structured dtypes, but these benchmarks comparing PandaPy to Pandas are extremely misleading. The largest input dataset has 1258 rows and 9 columns, so basically all these tests shows is that PandaPy has less Python overhead.For a more representative comparison, let's make everything 1000x larger, e.g., closing = np.concatenate(1000 * [closing])Here's how a few representative benchmark change:- describe: PandasPy was 5x faster, now 5x slower- add: PandasPy was 2-3x faster than pandas, now ~15x slower- concat: PandasPy was 25-70x faster, now 1-2x slower- drop/rename: PandasPy is now ~1000x faster (NumPy can clearly do these operations without any data copies)I couldn't test merge because it needs a sorted dataset, but hopefully you get the idea -- these benchmarks are meaningless, unless for some reason you only care about manipulating small datasets very quickly.At large scale, pandas has two major advantages over NumPy/PandasPy:- Pandas (often) uses a columnar data format, which makes it much faster to manipulate large datasets.- Pandas has hash tables which it can rely upon for fast look-ups instead sorting.

评论 #22144216 未加载

评论 #22145469 未加载

评论 #22145282 未加载

smabie超过 5 年前

Pandas is usable? I had no idea..Pandas is really badly designed, in the same way that most Python libraries are: each function has so many parameters. And a parameter can often be a bunch of different types. Pandas is useful, especially for time-series data, but no one particularly loves it. And, it’s embarrassingly slow. Maybe PandaPy is better, but I doubt it. When you start trying to use Python implemented functions (vs C ones) things are going to get bad no matter what you do.Speaking of which, I decided to port over a statistical model for betting from Python to Julia week ago. I’m not done yet, and this is my first major experience with Julia, but it’s been so much nicer than using Python. The performance can easily be 10x-50x faster without really doing any extra work.Also the language feels explicitly designed for scientific computing and really meshes well with the domain. Python the language never really was good for this, but the libraries were pretty compelling. Julia libraries have almost caught up (or in some domains, like linear algebra) have actually exceeded what’s available dor Python. Moreover, if you need to, PyCall is really easy to use.I’m going to go out on a limb and say that people shouldn’t be using Python for new scientific computing projects. Julia has arrived, and is better in everyway (I’m still unsure about the 1-based indexing, but I’m sure I’ll get over it. 0-based waa never that great in the first place).

评论 #22147170 未加载

评论 #22146643 未加载

评论 #22146233 未加载

评论 #22148535 未加载

评论 #22147552 未加载

fjp超过 5 年前

Some Python devs seem to pull in Pandas whenever any math is required.IMO Pandas documentation somehow manages to document every parameter of every method and somehow it’s almost as helpful as no documentation at all. Combined with the fact that it’s a huge package, I avoid it unless I really really need it.A version with human-understandable docs could convince me otherwise

评论 #22143588 未加载

评论 #22143730 未加载

评论 #22143983 未加载

评论 #22145111 未加载

gewa超过 5 年前

I worked with Pandas and numpy for different projects, and I really like the low level component way how numpy works. In most cases where I used Pandas I regretted it at some point. OOP and numpy in the first place would’ve been a better solution, especially because of the ease of Numba integration.

sriku超过 5 年前

Nice to see .. but I think Julia is pretty much targeted at not having to do these kinds of jugglery.(Don't get me wrong. I actually appreciate the work, but also use julia)

anakaine超过 5 年前

The one reference I didn't see was to chunking. Currently using Dask because of its graceful chunking of large and medium data - but pandaspy doesn't make reference to this capability.

enriquto超过 5 年前

My whole work consists in manipulating arrays of numbers, mostly in python, and I never found any use for pandas. Whenever I receive some code that uses pandas, it is easy to remove this dependency without much ado (it was not really necessary for anything).Can anybody point me to a reasonable use case of pandas? I mean, besides printing a matrix with lines of alternating colors.

评论 #22146090 未加载

评论 #22145284 未加载

评论 #22146175 未加载

评论 #22145910 未加载

评论 #22145269 未加载

评论 #22145301 未加载

beefield超过 5 年前

Slightly off-topic, I have been occasionally trying to learn to use pandas, but having worked quite a lot with SQL, there is one thing that I can't get over. Is there a way to force pandas to have same data type for each element in a column? (Especially pandas seems to think that NaN is a valid replacelemnt of None, and after that you really can't trust anything to run on a column because the data types may chnage.Or then, more likely, I have missed some idiomatic way to work with pandas.

评论 #22146284 未加载

gww超过 5 年前

There's an cool python library called anndata (<a href="https://icb-anndata.readthedocs-hosted.com/en/stable/anndata.AnnData.html" rel="nofollow">https://icb-anndata.readthedocs-hosted.com/en/stable/anndata...</a>). It's designed for single cell RNA-seq experiments where datasets have multiple 2d matrices of data along with row/column annotation data. It's use of NumPy structured arrays is interesting.

ben509超过 5 年前

If you've mucked with numpy dtypes, they're shockingly powerful, but this seems like a much nicer way to do it. Great idea!

kristianp超过 5 年前

Has anyone here compared Turi Create with pandas and numpy recently? It was open-sourced by apple: <a href="https://github.com/apple/turicreate" rel="nofollow">https://github.com/apple/turicreate</a>Seems like it's good for creating ml models and deploying them to apple devices.

hsaliak超过 5 年前

nice to see more libraries in python that embrace optional static typing

throwlaplace超过 5 年前

isn't pandas already built on top of numpy? so what does this mean and owing to what is it faster?

评论 #22143657 未加载