TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

What would it take to recreate dplyr in Python? (2020)

64 pointsby bshanksover 3 years ago

12 comments

sweezyjeezyover 3 years ago
Always been interested to know why pandas implemented index the way it did. I generally find myself doing .reset_index on everything by default because it&#x27;s just one less thing to think about, but it&#x27;s clear that pandas devs are very fond of it based on the API. Where it still feels weird is when e.g. groupby&#x2F;pivot by default return everything with a custom index, when I&#x27;ve given no indication that it needs to be treated differently to a column, but then e.g. merge doesn&#x27;t do this? It&#x27;s also written out by default in .to_csv, like just... why? Not useful for any csv that is to be used outside pandas. God help you if you end up needing to use a multi-index for something - deeply unpleasant.<p>Was this just a high level (possibly misguided) paradigm that the pandas devs fell in love with - or is there a good, performance related reason to embed it so deeply in the API?
评论 #29967778 未加载
评论 #29968463 未加载
评论 #29966231 未加载
评论 #29968630 未加载
评论 #29972509 未加载
评论 #29966829 未加载
评论 #29966752 未加载
peatmossover 3 years ago
Dplyr (and its spin-off dbplyr) is to me a fantastic <i>practical</i> example of the power of lispy metaprogramming. While R gets knocked a lot for not being a &quot;real&quot; programming language, or being weird, or just generally being different than e.g. Python, I don&#x27;t know that I&#x27;ve seen a cooler example of high-level programming language ideas expressed as cleanly.<p>Ditto the rest of the tidyverse.
评论 #29970263 未加载
评论 #29967606 未加载
dunefoxover 3 years ago
As a comparison to another language: for Julia there is DataFrames.jl: <a href="https:&#x2F;&#x2F;dataframes.juliadata.org&#x2F;stable&#x2F;" rel="nofollow">https:&#x2F;&#x2F;dataframes.juliadata.org&#x2F;stable&#x2F;</a><p>Comparison to dplyr, ...: <a href="https:&#x2F;&#x2F;dataframes.juliadata.org&#x2F;stable&#x2F;man&#x2F;comparisons&#x2F;#Comparison-with-the-R-package-dplyr" rel="nofollow">https:&#x2F;&#x2F;dataframes.juliadata.org&#x2F;stable&#x2F;man&#x2F;comparisons&#x2F;#Com...</a><p>Comparison to Pandas: <a href="https:&#x2F;&#x2F;dataframes.juliadata.org&#x2F;stable&#x2F;man&#x2F;comparisons&#x2F;#Comparison-with-the-Python-package-pandas" rel="nofollow">https:&#x2F;&#x2F;dataframes.juliadata.org&#x2F;stable&#x2F;man&#x2F;comparisons&#x2F;#Com...</a>
评论 #29972546 未加载
mgradowskiover 3 years ago
DuckDB and Polars are my bets in the Python data-wrangling space. I grew tired of Pandas&#x27; weird-ass API.
评论 #29966338 未加载
评论 #29966449 未加载
评论 #29970505 未加载
评论 #29967446 未加载
评论 #29967996 未加载
closedover 3 years ago
Author here--happy to answer questions :)<p>Siuba has come a long way since I wrote this, and now can optimize for fast grouped operations!:<p>* <a href="https:&#x2F;&#x2F;github.com&#x2F;machow&#x2F;siuba" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;machow&#x2F;siuba</a><p>* <a href="https:&#x2F;&#x2F;siuba.readthedocs.io&#x2F;en&#x2F;latest&#x2F;developer&#x2F;pandas-group-ops.html" rel="nofollow">https:&#x2F;&#x2F;siuba.readthedocs.io&#x2F;en&#x2F;latest&#x2F;developer&#x2F;pandas-grou...</a>
psimmover 3 years ago
This article compares dplyr syntax with pandas, siuba, polars, ibis and duckdb: <a href="https:&#x2F;&#x2F;simmering.dev&#x2F;blog&#x2F;dataframes&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simmering.dev&#x2F;blog&#x2F;dataframes&#x2F;</a><p>As other have said, escaping pandas is hard. Many visualization and data manipulation, validation and analysis libraries expect pandas input.<p>Siuba is really cool in that it offers a convenient syntax on top of pandas (and SQL databases) without requiring its own data format.
lysecretover 3 years ago
Hey regarding groupby operations in pandas. I have been moving more and more (for too complicated groupbys) to running them something like:<p><pre><code> out_rec = [] for id, group in data_frame.groupby(&quot;id&quot;): ladidida.... result = f(group) out_rec.append(result) </code></pre> in my experience it isn&#x27;t much slower than a groupby.apply.
评论 #29967940 未加载
评论 #29966518 未加载
评论 #29966489 未加载
armanboyaciover 3 years ago
I am distracted with the example provided in the post. I am pretty sure that most pandas users will just put course_ids on the columns. I mean the shape of the user_courses dataframe is not suitable for the task.<p><pre><code> (user_courses .set_index([&quot;student_id&quot;, &quot;course_id&quot;]) .unstack() .apply(lambda x: x+1))</code></pre>
tpoacherover 3 years ago
Meanwhile, every time I use an external library (including pandas) I still think numpy does everything you need, and does it well, and people just haven&#x27;t bothered to learn it properly and keep reinventing the wheel.<p>(no disrespect to to the package in the article or OP who I know is active in this thread. just a general motif that I keep coming across in python).
评论 #29979077 未加载
usermiover 3 years ago
There is a project called Datar (<a href="https:&#x2F;&#x2F;github.com&#x2F;pwwang&#x2F;datar" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;pwwang&#x2F;datar</a>), which mimics dplyr in Python.
评论 #29968947 未加载
mint2over 3 years ago
That looks neat and was very interesting. Pandas is sometimes a dark art to know how to do something fast enough.<p>Port the functionality of the R package but try to keep it python. Run flake8.
bobolitoover 3 years ago
Check out Tidypolars
评论 #29969452 未加载