I've had to dive into the pandas code over the last year for a project [0], and my attitude has shifted dramatically from...<p><pre><code> * old attitude: why does pandas have to make things so hard
* new attitude: pandas has a crazy difficult job
</code></pre>
I think this is most apparent in the functions that decide what "[d]type" a Block--the most basic thing that stores data in pandas--should be.<p><a href="https://github.com/pandas-dev/pandas/blob/4edcc5541ff3f6470f5e3c083cb83136119e6f0c/pandas/core/internals/blocks.py#L2973" rel="nofollow">https://github.com/pandas-dev/pandas/blob/4edcc5541ff3f6470f...</a><p>And then, for the ubiquitous Object dtype, often figure out which of the many possible more specific types to cast it to.<p>If you think that is easy, ask yourself what this outputs:<p><pre><code> import numpy as np
np.array([np.nan, 'a'])
</code></pre>
Lo and behold--it produces an array where the np.nan has been converted to the string "nan".<p>And yet<p><pre><code> import pandas as pd
pd.Series([np.nan, "a"])
</code></pre>
Knows this, has your back, and does not stringify it.<p>It also has a pathological fixation on <i>when</i> it tries to convert dtypes, since avoiding all the bad conversion outcomes is a relatively time intensive process (compared to e.g. creating a numpy array).<p>I realize things could be much easier in pandas user facing interface, but really appreciate the sheer amount of effort that has gone into its dtype wrangling.<p>[0]: <a href="http://github.com/machow/siuba" rel="nofollow">http://github.com/machow/siuba</a>
Great accomplishment and kudos to the dedicated maintainers. That being said, I've always had a love-hate relationship with pandas. It is a very powerful library and does a ton, but yet the API is all over the place and unless you use it regularly for a long period of time, it is almost impossible to get fluent with it. Every time I am away from it for a couple of months, I find even doing the most basic things to be complicated/confusing and find myself on stackoverflow way too often.<p>By comparison, the API of something like Pytorch is an absolute pleasure to use and even though I'm not using it all the time, I almost have no trouble every time I begin training models/trying out new things in Pytorch.<p>All that being said, this is definitely a step in the right direction and hopefully the API gets a bit more coherent over time.
I know I'm not the only one, but it's hard to imagine doing my job the last several year without Pandas. Even though Pandas has been used in production by many people as basically a 1.0.0 release for a long time, this an amazing milestone and I think everyone in my office smiled when they saw the release news.<p>I think it's worth it to acknowledge the great stewardship of the community by all the Pandas developers (and the rest of people in the PyData ecosystem). It has been an inspiration for me as I create and contribute to open source libraries for data science [0][1].<p>[0] <a href="https://github.com/FeatureLabs/featuretools/" rel="nofollow">https://github.com/FeatureLabs/featuretools/</a>
[1] <a href="https://github.com/FeatureLabs/compose/" rel="nofollow">https://github.com/FeatureLabs/compose/</a>
I am looking forward to a decade of fewer API breaking changes. However, 1.0 introduces a new column type for strings, recommends its use over the old "object" column type, yet says it is "Experimental and may change at any time."<p>How are we supposed to interpret this in light of the promise that there will be no more API breakages until 2.0? It reads as if this promise does not apply to string data, which impacts rather a lot of use cases.
Could we collect some recommendations for really good books, online guides, tutorials, and recipes for current Pandas?<p>There are quite a few complaints here about the interface being confusing and difficult to use, and I feel like some of this is due to there being significant differences between versions. I would love to read a medium-length online free tutorial on Pandas 1.0, but it seems like most of what turns on up google are short idiosyncratic tutorials on specific tasks in various versions.
Pandas is my least favorite necessary evil. It's always changing, far too expansive API costs me about an hour a week.<p>Whenever one can use a utc epoch column for time indexed data in a raw numpy array instead, one should.
No mention of vaex?<p><a href="https://vaex.readthedocs.io/en/latest/" rel="nofollow">https://vaex.readthedocs.io/en/latest/</a><p>It has a cleaner, leaner API + the ability to use memory-mapped files.
I've been waiting for this release for years and I hoped for one thing and one thing only - for pandas to have a proper way of dealing with NULLs. And it does have it... OPTIONALLY.<p>It's great that the whole thing with extension arrays, custom types etc. has lead to this, but when the devs have, after 10+ years, the biggest chance for a backward incompatible change, this is the one to make. By making it optional, they are fixing it for the very few that know of its existence.<p>I love pandas and a sizeable part of my career depended on it - and while I don't use it anymore (partly because of the NULLs), I wish it the best and I hope there will be a future release with this breaking change.
Congratulations to the Pandas team.
You lot have saved my bacon so many times over the last few years, I owe you many breakfasts.<p>Long live the King.
"pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language."<p>If anyone else is wondering what this is. (Source: project homepage