So many people don't realize pandas can be horribly slow if you use it "wrong" -- i.e., for computations that don't vectorize in the way that's native for pandas. Also, working with dataframes that contain millions of rows is like playing a Russian roulette -- there's usually many ways to do the same thing in pandas, if you guessed correct you'll wait a minute or two till the computation's done, if you guessed wrong it'll run out of ram, segfault or never finish.<p>For big datasets, I've stopped using pandas myself a few years back for anything other than printing dataframe, datetime index series, doing quick plots, or working with tiny/toy datasets -- in favor of numpy structured/record arrays. It's kind of the same thing, without all the groupby/index fluff, but very fast.<p>Just last week, I've helped my colleague speed up her code (numerical solver for financial data) by more than 100x, the biggest part of it was ditching pandas entirely and using numpy.
<p><pre><code> But pandas’ magical simplicity makes things like computed columns immediately intuitive:
> data['% of total'] = data.amount / data.amount.sum()
</code></pre>
Is that immediately intuitive? I'm staring at this trying to understand what it's doing. Is the / operator overloaded? data.amount is one particular amount, and data.amount.sum() is the sum of all amounts? Why does the "computed column" property goes on the same data object as the actual data? Maybe it's immediately intuitive if you've used pandas.
For installation of Jupyter, Anaconda works well across all platforms, even most slightly older OSes.<p><a href="http://jupyter.readthedocs.io/en/latest/install.html" rel="nofollow">http://jupyter.readthedocs.io/en/latest/install.html</a><p>It does work better for people to install Jupyter with Anaconda, rather than use virtual environments, because there's not the overhead of also having to learn about virtual environments. People tend to think of them as just associated with the class and don't use them as much for their own work outside of the workshop or course.
I spend about 8 months of the year teaching pandas to journalism students, and it's a wild ride! Despite some of the iffy syntax and pandas' seeming inability to standardize parameter names, the students seem to grok the workflow much more quickly than wrangling lists and dictionaries in the "normal" world of Python.<p>I know everyone loves the reproducibility Notebooks supposedly bring to the table, but without a doubt my favorite part is the ability to export super-unattractive matplotlib charts as PDF, clean them up in Illustrator, and suddenly find yourself with publication-quality graphics. Knowing you're producing something more than just some numbers to toss in a story can be a strong sell to a lot of folks.
I really like Jupyter, but somehow I'm not in love with it. Like, every time I fire it up to use it for quick data analysis, I seem to inevitably end up back in sublime + bash, sending plots to disk. Am I the odd one out?
It is hard to overstate just how ferociously bad the experience of getting Jupyter from blank computer to the equivalent of "Hello world" actually is.
I've found that most of the queries that journalists are trying to run are pretty basic, mostly filtering and histograms. Setting up a virtualenv, dependencies, etc can be tough. And RTFM isn't sufficient for someone getting started. I was surprised that nothing existed for this, so I built it.<p>It has the basics of a Jupyter notebook - filter, sum, average, plot. So far it's attracted a pretty interesting audience including journalists, but also lawyers and consultants.<p>www.CSVExplorer.com
Side note, I googled "pandas" and get a lot of results related to the python library, and very few related to the large mammal. Bing doesn't give me any related to the python library. Google knows me too well.