科技回声

5 条评论

otakucode超过 10 年前

Whew, reading the first few paragraphs after seeing the title started to scare me. I was afraid they were going to advocate locking data up inside of a proprietary app and only releasing that to the public in place of releasing the raw data!I ran into this years ago with the IMDB dataset. It appears to be formatted such that it aggressively resists sane parsing. Of course, I expected to want to update the data and whatnot, so I built code to download the data files or updates, parse them, and put them into a Sane Format (in my book, only CSV and JSON qualify right now). Then I wrote a simple tool to take any generic JSON and create tables from it and insert all the data. This just always seemed to be the right thing to do to me. Just hacking the file into a useable format and plunging ahead with analysis seemed like a bad option to me, but I take it from this article that it is the common approach?It may just be an artifact of the kinds of systems I've worked on (bank, govt) but I'm not comfortable unless 'deployment' consists of executing 1 script which can take a system from absolute barebones (no DB schema, no existing tables, no prearranged libraries, nothing) to production-ready. What if you have a catastrophe and your backups are hosed? What if you want to spin off a new environment for testing? The idea that there has to be existing state whose personal history is assumed to be in a certain state, or that after the system deploys someone has to grab some scripts out of their home directory and remember to apply them (and in the right order) before things can get going just terrifies me. What if that employee gets a brain tumor? I suppose it doesn't matter quite as much if your system being down for 5 minutes doesn't result in a report on the national news and impact hundreds of millions of people, but still... don't most people have a personal investment in knowing their system isn't just an array of spinning plates with a chasm of chaos awaiting an earthquake?

Blahah超过 10 年前

Beautiful idea, not dissimilar from dat [0] (if you haven't already, you guys should talk).I find the Django relationship to be an odd choice - the vast majority of people working with data are not using Django. Why pair the two?[0]: dat-data.com

评论 #8369668 未加载

jboggan超过 10 年前

A lot of the nitty-gritty data munging and processing often gets discarded after a project or never included in the project repo in a meaningful way. I like Drake [0] because we used it a lot at Factual and it really made data generation and formatting very repeatable and easy.I really think the packaging system the author is going for would be best built on top of Drake or a similar workflow management program. Instead of following their laundry list of configuration steps one could manage that automatically with a source-controlled workflow. Drake does have the advantages of non-linear and async workflows being pretty easy to build, maintain, and update.What I would love to see is a data package manager that downloads the raw data and processing workflow, updates any software packages needed to run the workflow, and then spits out the data in the form you need it, whether CSV/TSV/JSON/etc. I don't know much about dat yet, but it looks like it would be a good end-point for serving the data as well.0 - <a href="https://github.com/Factual/drake" rel="nofollow">https://github.com/Factual/drake</a>

评论 #8369952 未加载

mshron超过 10 年前

Love the idea!I would ask for a little more separation of concerns. One package for raw but cleaned data with a collection of schemas, and a second for loading arbitrary data + schemas in to Django (and probably accomplishing all of the extra administrative steps provided in the example).That way if I want to add other schemas for a non-Django use in the same package (say if I care more about analysis than clicky-interfaces) or not use Django, I can still use a package manager for the same data.

评论 #8369661 未加载

palewire超过 10 年前

A humble suggestion from your friends at the California Civic Data Coalition

5 条评论

otakucode超过 10 年前

Blahah超过 10 年前

评论 #8369668 未加载

jboggan超过 10 年前

评论 #8369952 未加载

mshron超过 10 年前

评论 #8369661 未加载

palewire超过 10 年前

A humble suggestion from your friends at the California Civic Data Coalition

Package Data Like Software

5 条评论

Package Data Like Software

5 条评论