TechEcho

12 comments

Always been interested to know why pandas implemented index the way it did. I generally find myself doing .reset_index on everything by default because it's just one less thing to think about, but it's clear that pandas devs are very fond of it based on the API. Where it still feels weird is when e.g. groupby/pivot by default return everything with a custom index, when I've given no indication that it needs to be treated differently to a column, but then e.g. merge doesn't do this? It's also written out by default in .to_csv, like just... why? Not useful for any csv that is to be used outside pandas. God help you if you end up needing to use a multi-index for something - deeply unpleasant.Was this just a high level (possibly misguided) paradigm that the pandas devs fell in love with - or is there a good, performance related reason to embed it so deeply in the API?

评论 #29967778 未加载

评论 #29968463 未加载

评论 #29966231 未加载

评论 #29968630 未加载

评论 #29972509 未加载

评论 #29966829 未加载

评论 #29966752 未加载

peatmossover 3 years ago

Dplyr (and its spin-off dbplyr) is to me a fantastic practical example of the power of lispy metaprogramming. While R gets knocked a lot for not being a "real" programming language, or being weird, or just generally being different than e.g. Python, I don't know that I've seen a cooler example of high-level programming language ideas expressed as cleanly.Ditto the rest of the tidyverse.

评论 #29970263 未加载

评论 #29967606 未加载

dunefoxover 3 years ago

As a comparison to another language: for Julia there is DataFrames.jl: <a href="https://dataframes.juliadata.org/stable/" rel="nofollow">https://dataframes.juliadata.org/stable/</a>Comparison to dplyr, ...: <a href="https://dataframes.juliadata.org/stable/man/comparisons/#Comparison-with-the-R-package-dplyr" rel="nofollow">https://dataframes.juliadata.org/stable/man/comparisons/#Com...</a>Comparison to Pandas: <a href="https://dataframes.juliadata.org/stable/man/comparisons/#Comparison-with-the-Python-package-pandas" rel="nofollow">https://dataframes.juliadata.org/stable/man/comparisons/#Com...</a>

评论 #29972546 未加载

mgradowskiover 3 years ago

DuckDB and Polars are my bets in the Python data-wrangling space. I grew tired of Pandas' weird-ass API.

评论 #29966338 未加载

评论 #29966449 未加载

评论 #29970505 未加载

评论 #29967446 未加载

评论 #29967996 未加载

closedover 3 years ago

Author here--happy to answer questions :)Siuba has come a long way since I wrote this, and now can optimize for fast grouped operations!:* <a href="https://github.com/machow/siuba" rel="nofollow">https://github.com/machow/siuba</a>* <a href="https://siuba.readthedocs.io/en/latest/developer/pandas-group-ops.html" rel="nofollow">https://siuba.readthedocs.io/en/latest/developer/pandas-grou...</a>

psimmover 3 years ago

This article compares dplyr syntax with pandas, siuba, polars, ibis and duckdb: <a href="https://simmering.dev/blog/dataframes/" rel="nofollow">https://simmering.dev/blog/dataframes/</a>As other have said, escaping pandas is hard. Many visualization and data manipulation, validation and analysis libraries expect pandas input.Siuba is really cool in that it offers a convenient syntax on top of pandas (and SQL databases) without requiring its own data format.

lysecretover 3 years ago

Hey regarding groupby operations in pandas. I have been moving more and more (for too complicated groupbys) to running them something like:<pre><code> out_rec = [] for id, group in data_frame.groupby("id"): ladidida.... result = f(group) out_rec.append(result) </code></pre> in my experience it isn't much slower than a groupby.apply.

评论 #29967940 未加载

评论 #29966518 未加载

评论 #29966489 未加载

armanboyaciover 3 years ago

I am distracted with the example provided in the post. I am pretty sure that most pandas users will just put course_ids on the columns. I mean the shape of the user_courses dataframe is not suitable for the task.<pre><code> (user_courses .set_index(["student_id", "course_id"]) .unstack() .apply(lambda x: x+1))</code></pre>

tpoacherover 3 years ago

Meanwhile, every time I use an external library (including pandas) I still think numpy does everything you need, and does it well, and people just haven't bothered to learn it properly and keep reinventing the wheel.(no disrespect to to the package in the article or OP who I know is active in this thread. just a general motif that I keep coming across in python).

评论 #29979077 未加载

usermiover 3 years ago

There is a project called Datar (<a href="https://github.com/pwwang/datar" rel="nofollow">https://github.com/pwwang/datar</a>), which mimics dplyr in Python.

评论 #29968947 未加载

mint2over 3 years ago

That looks neat and was very interesting. Pandas is sometimes a dark art to know how to do something fast enough.Port the functionality of the R package but try to keep it python. Run flake8.

bobolitoover 3 years ago

Check out Tidypolars

评论 #29969452 未加载

12 comments

sweezyjeezyover 3 years ago

评论 #29967778 未加载

评论 #29968463 未加载

评论 #29966231 未加载

评论 #29968630 未加载

评论 #29972509 未加载

评论 #29966829 未加载

评论 #29966752 未加载

peatmossover 3 years ago

评论 #29970263 未加载

评论 #29967606 未加载

dunefoxover 3 years ago

评论 #29972546 未加载

mgradowskiover 3 years ago

DuckDB and Polars are my bets in the Python data-wrangling space. I grew tired of Pandas' weird-ass API.

评论 #29966338 未加载

评论 #29966449 未加载

评论 #29970505 未加载

评论 #29967446 未加载

评论 #29967996 未加载

closedover 3 years ago

psimmover 3 years ago

lysecretover 3 years ago

评论 #29967940 未加载

评论 #29966518 未加载

评论 #29966489 未加载

armanboyaciover 3 years ago

tpoacherover 3 years ago

评论 #29979077 未加载

usermiover 3 years ago

There is a project called Datar (<a href="https://github.com/pwwang/datar" rel="nofollow">https://github.com/pwwang/datar</a>), which mimics dplyr in Python.

评论 #29968947 未加载

mint2over 3 years ago

That looks neat and was very interesting. Pandas is sometimes a dark art to know how to do something fast enough.Port the functionality of the R package but try to keep it python. Run flake8.

bobolitoover 3 years ago

Check out Tidypolars

评论 #29969452 未加载

What would it take to recreate dplyr in Python? (2020)

12 comments

What would it take to recreate dplyr in Python? (2020)

12 comments