I found this really helpful. Not so much to do with the actual article, but to do with actual making me aware of Google Refine.<p>Installation was a breeze. I couldn't find any instructions, but it was as simple as downloading for Linux, extracting, the running the shell script.<p><a href="http://code.google.com/p/google-refine/downloads/detail?name=google-refine-2.5-r2407.tar.gz&can=1&q=" rel="nofollow">http://code.google.com/p/google-refine/downloads/detail?name...</a><p>The application automatically opens in a new Chrome window.<p>From here, I grabbed a data dump from one of our external providers.<p>We work with a lot of providers who are <i>really</i> technologically challenged. I'd love to be able to say, here you are.. here is our API, start pushing your content to us. But in practice they don't even know what their XML feeds do. We need their data, but getting a consistent dataset from them when they seem to change their format regularly is a pain! And when importing only 10 or so items at a time it's excruciatingly painful.<p>Today I learnt how easy that can be with Google Refine!
As a data analyst-type-person, I can't recommend enough the use of Google Refine. When someone told me about it, I thought "that's dumb, I would just write a cleaning/regex script and connect to my DB"...but tried it out anyway, because my colleague is a much better power programmer than I am.<p>That's how good Refine is...it adds an extra, GUI-driven step to the workflow, but it's so well executed that it makes data exploration (and cleaning) effortless.<p>I wrote a tutorial awhile back about how I used it in an investigative reporting project:
<a href="http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning" rel="nofollow">http://www.propublica.org/nerds/item/using-google-refine-for...</a>
Is this worth looking into for someone who already knows perl, R and the unix zoo? Or is it more targetted at people who don't deal with data on a regular basis?
This seems, on the surface at least, very similar to what ScraperWiki is trying to do, by converting messy publicly available data in to a more structured format.<p>Am I correct in that understanding, or did I miss the boat?
Not very impressive for people who work with data sets often and probably have tools like SAS or Excel, but good to know it exists as a free alternative.