The problem with this snippet isn't really the chaining; it's all the inlining. All the lists, and the many lambdas used, could be variables. Does this approach make it "professional code"?<p>The responses seem out of context, too:<p>>David: What's the elevator pitch for writing pandas code the way that you do?<p>>Matt: One common thing that you'll see in the data science world is this notion that there's like Untitled1.ipynb and Untitled2.ipynb[...]. My goal is to help with that so (...) you have Analysis_for_ClientA.ipynb and that's the only notebook you have. And you can come back to it tomorrow and pick it up where you left off and you're going to be productive. Your code will be easier to read[...].<p>This is a tweet. Filenames aren't even argued. This doesn't answer the interviewer's question either. Writing code != naming files.<p>>David: What is it that separates beginner pandas code from professional pandas code?<p>>Matt: I would say that if you want to write good pandas code (...) you should know how to write lambdas. You should know how to do list and dictionary comprehensions. Dictionary unpacking (...) is super useful in pandas world.<p>Absolutely. But professionals use variables, too. Possibly even more so.
> In my 20-plus years of working with data, I have multiple steps and I don't care about the intermediate steps.<p>Oh boy, do i care about every single intermediate step though!<p>Especially in pandas, where we play "where's the NaN" <i>all the damn time</i>.
I've done chaining myself and seen people do it as well. The folks writing these massive functions may think they are gurus, but it makes functions virtually impossible to debug in prod. It flies against the wisdom of "make your functions small"<p>I think is one area where pandas and Polaris can be improved. How do you write long chains and slot in breaks and testing?
I had a whole rant queued up on "Pandas and its consequences have been a disaster for the human race" (well, at least for newbie programmers), but I think instead I want to focus on the damn dictionary splats. I just don't get it - it's pure "clever" code in the pejorative Dijkstra sense. It's hard to edit, it's hard to typecheck. Why not pay the very low whitespace tax to give each key/value pair its own longhand line:<p><pre><code> .astype({
'central_air': bool,
'ms_subclass': 'uint8',
...
})
</code></pre>
Now if, say, ms_subclass and overall_qual need <i>different</i> types, that's an easy diff to read. Ah, but I suppose that wouldn't be as Twitter-friendly.
Random lists of strings are hard to decipher. What is that set of values supposed to represent? And it interrupts the flow of figuring out what’s going on.<p>I prefer assigning lists like that to informatively name variable rather than have leave them the subject of speculation. It’s easier yo add add clarifying comments that way too.<p>In sql or pandas, long lists of values not broken up are hard to read. It’s easy to scan down a single value on each row, not random length values spread randomly across the screen.<p>Also That is chaining far too much in a single go
I personally don’t like method chaining in Pandas because it makes troubleshooting difficult for me. On the other hand I love piping functions in tidyverse in R. I think there are a few libraries in Python that bring pipes to Pandas. I haven’t used any though so can’t comment on their usefulness.<p>Edit: Here is a library that brings pipes to pandas <a href="https://github.com/pwwang/datar" rel="nofollow">https://github.com/pwwang/datar</a>
I love the function chaining. It's basically functional programming with "immutable" intermediates (yes I know they're not really immutable, but we don't modify them in place).<p>Another good example of this style is tf.data pipelines. Also a very nice API.
Quite frankly this is unreadable and unmaintainable code.<p>He doesn't articulate any of the virtues of it either, aside from some hand waving about 'memory' that doesn't get fleshed out.