Always been interested to know why pandas implemented index the way it did. I generally find myself doing .reset_index on everything by default because it's just one less thing to think about, but it's clear that pandas devs are very fond of it based on the API. Where it still feels weird is when e.g. groupby/pivot by default return everything with a custom index, when I've given no indication that it needs to be treated differently to a column, but then e.g. merge doesn't do this? It's also written out by default in .to_csv, like just... why? Not useful for any csv that is to be used outside pandas. God help you if you end up needing to use a multi-index for something - deeply unpleasant.<p>Was this just a high level (possibly misguided) paradigm that the pandas devs fell in love with - or is there a good, performance related reason to embed it so deeply in the API?
Dplyr (and its spin-off dbplyr) is to me a fantastic <i>practical</i> example of the power of lispy metaprogramming. While R gets knocked a lot for not being a "real" programming language, or being weird, or just generally being different than e.g. Python, I don't know that I've seen a cooler example of high-level programming language ideas expressed as cleanly.<p>Ditto the rest of the tidyverse.
As a comparison to another language: for Julia there is DataFrames.jl: <a href="https://dataframes.juliadata.org/stable/" rel="nofollow">https://dataframes.juliadata.org/stable/</a><p>Comparison to dplyr, ...: <a href="https://dataframes.juliadata.org/stable/man/comparisons/#Comparison-with-the-R-package-dplyr" rel="nofollow">https://dataframes.juliadata.org/stable/man/comparisons/#Com...</a><p>Comparison to Pandas: <a href="https://dataframes.juliadata.org/stable/man/comparisons/#Comparison-with-the-Python-package-pandas" rel="nofollow">https://dataframes.juliadata.org/stable/man/comparisons/#Com...</a>
Author here--happy to answer questions :)<p>Siuba has come a long way since I wrote this, and now can optimize for fast grouped operations!:<p>* <a href="https://github.com/machow/siuba" rel="nofollow">https://github.com/machow/siuba</a><p>* <a href="https://siuba.readthedocs.io/en/latest/developer/pandas-group-ops.html" rel="nofollow">https://siuba.readthedocs.io/en/latest/developer/pandas-grou...</a>
This article compares dplyr syntax with pandas, siuba, polars, ibis and duckdb: <a href="https://simmering.dev/blog/dataframes/" rel="nofollow">https://simmering.dev/blog/dataframes/</a><p>As other have said, escaping pandas is hard. Many visualization and data manipulation, validation and analysis libraries expect pandas input.<p>Siuba is really cool in that it offers a convenient syntax on top of pandas (and SQL databases) without requiring its own data format.
Hey regarding groupby operations in pandas. I have been moving more and more (for too complicated groupbys) to running them something like:<p><pre><code> out_rec = []
for id, group in data_frame.groupby("id"):
ladidida....
result = f(group)
out_rec.append(result)
</code></pre>
in my experience it isn't much slower than a groupby.apply.
I am distracted with the example provided in the post. I am pretty sure that most pandas users will just put course_ids on the columns. I mean the shape of the user_courses dataframe is not suitable for the task.<p><pre><code> (user_courses
.set_index(["student_id",
"course_id"])
.unstack()
.apply(lambda x: x+1))</code></pre>
Meanwhile, every time I use an external library (including pandas) I still think numpy does everything you need, and does it well, and people just haven't bothered to learn it properly and keep reinventing the wheel.<p>(no disrespect to to the package in the article or OP who I know is active in this thread. just a general motif that I keep coming across in python).
There is a project called Datar (<a href="https://github.com/pwwang/datar" rel="nofollow">https://github.com/pwwang/datar</a>), which mimics dplyr in Python.
That looks neat and was very interesting. Pandas is sometimes a dark art to know how to do something fast enough.<p>Port the functionality of the R package but try to keep it python. Run flake8.