TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Using Google Refine to Clean a Data Set

69 pointsby craig552ukalmost 13 years ago

6 comments

richardvalmost 13 years ago
I found this really helpful. Not so much to do with the actual article, but to do with actual making me aware of Google Refine.<p>Installation was a breeze. I couldn't find any instructions, but it was as simple as downloading for Linux, extracting, the running the shell script.<p><a href="http://code.google.com/p/google-refine/downloads/detail?name=google-refine-2.5-r2407.tar.gz&#38;can=1&#38;q=" rel="nofollow">http://code.google.com/p/google-refine/downloads/detail?name...</a><p>The application automatically opens in a new Chrome window.<p>From here, I grabbed a data dump from one of our external providers.<p>We work with a lot of providers who are <i>really</i> technologically challenged. I'd love to be able to say, here you are.. here is our API, start pushing your content to us. But in practice they don't even know what their XML feeds do. We need their data, but getting a consistent dataset from them when they seem to change their format regularly is a pain! And when importing only 10 or so items at a time it's excruciatingly painful.<p>Today I learnt how easy that can be with Google Refine!
评论 #4218032 未加载
评论 #4217614 未加载
dansoalmost 13 years ago
As a data analyst-type-person, I can't recommend enough the use of Google Refine. When someone told me about it, I thought "that's dumb, I would just write a cleaning/regex script and connect to my DB"...but tried it out anyway, because my colleague is a much better power programmer than I am.<p>That's how good Refine is...it adds an extra, GUI-driven step to the workflow, but it's so well executed that it makes data exploration (and cleaning) effortless.<p>I wrote a tutorial awhile back about how I used it in an investigative reporting project: <a href="http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning" rel="nofollow">http://www.propublica.org/nerds/item/using-google-refine-for...</a>
frankcalmost 13 years ago
Is this worth looking into for someone who already knows perl, R and the unix zoo? Or is it more targetted at people who don't deal with data on a regular basis?
评论 #4218486 未加载
评论 #4219189 未加载
guard-of-terraalmost 13 years ago
I wonder why they won't let you to open local files without passing their content via browser. Should be very useful when run locally.
评论 #4217764 未加载
评论 #4217760 未加载
评论 #4220555 未加载
dpcxalmost 13 years ago
This seems, on the surface at least, very similar to what ScraperWiki is trying to do, by converting messy publicly available data in to a more structured format.<p>Am I correct in that understanding, or did I miss the boat?
评论 #4217909 未加载
chucknelsonalmost 13 years ago
Not very impressive for people who work with data sets often and probably have tools like SAS or Excel, but good to know it exists as a free alternative.
评论 #4218124 未加载
评论 #4218042 未加载
评论 #4218350 未加载