TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Automated Data Wrangling

92 pointsby makaimcalmost 4 years ago

9 comments

lettergramalmost 4 years ago
Hope this helps someone:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;capitalone&#x2F;DataProfiler" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;capitalone&#x2F;DataProfiler</a><p>We’re working to automate much of that as well in a Python library.<p>The end goal is to:<p>1. point at any dataset and load it with one command<p>2. Calculate statistics and identify entities with one command<p>3. Generate robust reports with one command.<p>Regarding the data wrangling... Believe it or not, even automatically detecting a delimited file with a header is hard work. Imagine a header can be on the 3rd row and has a title and author ship one rows 1 and 2 respectively. Further, the delimiter might be the “@“ symbol!<p>The linked library wrote handles that scenario. But that’s just CSVs, there’s also Json, parquet, Avro, etc etc...<p>This is an extraordinarily deep and complex field.
评论 #27337463 未加载
评论 #27323872 未加载
评论 #27323353 未加载
评论 #27323359 未加载
评论 #27322411 未加载
lysecretalmost 4 years ago
I spend a good portion of my time creating &#x2F; cleaning &#x2F; structuring &#x2F; feature engineering &#x2F; running ml on data and I am extremely skeptical of automated ways of doing that. Everything I have seen so far was so laughably besides the point. This whole process involves an extreme amount of judgment and tradeoffs and most importantly a lot of knowledge of the respective domain. I have seen so many people with no domain knowledge trying to run an &quot;automated approache&quot; and horribly fail.
评论 #27324983 未加载
ab111111111almost 4 years ago
I think a large part of the problem is that the Tolstoy &quot;All clean data is clean in the same way, but all dirty data is dirty in its own way&quot; riff is itself an over-simplification. In practice, &quot;clean&quot; just means &quot;fit for my purpose&quot; and &quot;dirty&quot; just means &quot;unfit for my purpose&quot;.<p>If counting the instances of a particular string is all you need to do (and it quite often is) then de facto all data is &quot;clean&quot;, as you can get the job done with grep. If, on the other hand, you require a relational database with strict primary and foreign key definitions and NULL limitations, then you&#x27;re gonna have to put a lot more work into scrubbing.<p>Added to that, the more stringent the defition of &quot;clean&quot; you&#x27;re working with, the more you&#x27;re going to have to deal with trade-offs. E.g. what do you do with missing required data field values? Loosen the requirement? Add defaults? Dump those rows? And what do you do with data that makes no sense, e.g. where end dates come before start dates? What if you need to anonymize the data to meet data privacy requirements? Is rounding dates to the nearest month or year fine, or will it render the data useless?<p>There are recurrent problems, and recurrent solutions, and the author is absolutely right to say that more open source work should be done in that area. But due to complexity and inconsistency of the definitions of &quot;clean&quot; and &quot;dirty&quot;, we will unfortunately never get to the point where we just feed some library the path to an arbitrary dirty data directory, press enter, go off to grab a coffee and come back to find our freshly cleaned data waiting for us.
评论 #27324872 未加载
hermitcrabalmost 4 years ago
1. Lets use machine learning on this dataset.<p>2. The data is dirty. We need to clean it up.<p>3. Lets clean it up using machine learning!<p>4. Goto 1.<p>It&#x27;s machine learning all the way down...
评论 #27325737 未加载
daemonkalmost 4 years ago
There is a grey area between cleaning data and feature engineering. At some point, the more sophisticated methods for data cleaning becomes imputation and inference of features. I wonder what are the ramifications of this.
stewbrewalmost 4 years ago
I hate SPSS, SAS and other GUI enhanced statistical software because it makes it too easy for the blind with no sound training to analyze data and come up with laughable conclusions. This is a nightmare.
评论 #27324767 未加载
评论 #27325702 未加载
AntonioLalmost 4 years ago
An impressive startup in the UK in this space was Wrapidity.<p>It is my favourite university spinout story here on this side of the pond.
nagarcalmost 4 years ago
It&#x27;s nice to see your thoughts. What are the next steps. i liked the idea of cleaned data for public good.
fishcakesalmost 4 years ago
This is a terrific idea!