Hope this helps someone:<p><a href="https://github.com/capitalone/DataProfiler" rel="nofollow">https://github.com/capitalone/DataProfiler</a><p>We’re working to automate much of that as well in a Python library.<p>The end goal is to:<p>1. point at any dataset and load it with one command<p>2. Calculate statistics and identify entities with one command<p>3. Generate robust reports with one command.<p>Regarding the data wrangling... Believe it or not, even automatically detecting a delimited file with a header is hard work. Imagine a header can be on the 3rd row and has a title and author ship one rows 1 and 2 respectively. Further, the delimiter might be the “@“ symbol!<p>The linked library wrote handles that scenario. But that’s just CSVs, there’s also Json, parquet, Avro, etc etc...<p>This is an extraordinarily deep and complex field.
I spend a good portion of my time creating / cleaning / structuring / feature engineering / running ml on data and I am extremely skeptical of automated ways of doing that.
Everything I have seen so far was so laughably besides the point. This whole process involves an extreme amount of judgment and tradeoffs and most importantly a lot of knowledge of the respective domain. I have seen so many people with no domain knowledge trying to run an "automated approache" and horribly fail.
I think a large part of the problem is that the Tolstoy "All clean data is clean in the same way, but all dirty data is dirty in its own way" riff is itself an over-simplification. In practice, "clean" just means "fit for my purpose" and "dirty" just means "unfit for my purpose".<p>If counting the instances of a particular string is all you need to do (and it quite often is) then de facto all data is "clean", as you can get the job done with grep. If, on the other hand, you require a relational database with strict primary and foreign key definitions and NULL limitations, then you're gonna have to put a lot more work into scrubbing.<p>Added to that, the more stringent the defition of "clean" you're working with, the more you're going to have to deal with trade-offs. E.g. what do you do with missing required data field values? Loosen the requirement? Add defaults? Dump those rows? And what do you do with data that makes no sense, e.g. where end dates come before start dates? What if you need to anonymize the data to meet data privacy requirements? Is rounding dates to the nearest month or year fine, or will it render the data useless?<p>There are recurrent problems, and recurrent solutions, and the author is absolutely right to say that more open source work should be done in that area. But due to complexity and inconsistency of the definitions of "clean" and "dirty", we will unfortunately never get to the point where we just feed some library the path to an arbitrary dirty data directory, press enter, go off to grab a coffee and come back to find our freshly cleaned data waiting for us.
1. Lets use machine learning on this dataset.<p>2. The data is dirty. We need to clean it up.<p>3. Lets clean it up using machine learning!<p>4. Goto 1.<p>It's machine learning all the way down...
There is a grey area between cleaning data and feature engineering. At some point, the more sophisticated methods for data cleaning becomes imputation and inference of features. I wonder what are the ramifications of this.
I hate SPSS, SAS and other GUI enhanced statistical software because it makes it too easy for the blind with no sound training to analyze data and come up with laughable conclusions. This is a nightmare.