Whew, reading the first few paragraphs after seeing the title started to scare me. I was afraid they were going to advocate locking data up inside of a proprietary app and only releasing that to the public in place of releasing the raw data!<p>I ran into this years ago with the IMDB dataset. It appears to be formatted such that it aggressively resists sane parsing. Of course, I expected to want to update the data and whatnot, so I built code to download the data files or updates, parse them, and put them into a Sane Format (in my book, only CSV and JSON qualify right now). Then I wrote a simple tool to take any generic JSON and create tables from it and insert all the data. This just always seemed to be the right thing to do to me. Just hacking the file into a useable format and plunging ahead with analysis seemed like a bad option to me, but I take it from this article that it is the common approach?<p>It may just be an artifact of the kinds of systems I've worked on (bank, govt) but I'm not comfortable unless 'deployment' consists of executing 1 script which can take a system from absolute barebones (no DB schema, no existing tables, no prearranged libraries, nothing) to production-ready. What if you have a catastrophe and your backups are hosed? What if you want to spin off a new environment for testing? The idea that there has to be existing state whose personal history is assumed to be in a certain state, or that after the system deploys someone has to grab some scripts out of their home directory and remember to apply them (and in the right order) before things can get going just terrifies me. What if that employee gets a brain tumor? I suppose it doesn't matter quite as much if your system being down for 5 minutes doesn't result in a report on the national news and impact hundreds of millions of people, but still... don't most people have a personal investment in knowing their system isn't just an array of spinning plates with a chasm of chaos awaiting an earthquake?