TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Package Data Like Software

30 pointsby palewireover 10 years ago

5 comments

otakucodeover 10 years ago
Whew, reading the first few paragraphs after seeing the title started to scare me. I was afraid they were going to advocate locking data up inside of a proprietary app and only releasing that to the public in place of releasing the raw data!<p>I ran into this years ago with the IMDB dataset. It appears to be formatted such that it aggressively resists sane parsing. Of course, I expected to want to update the data and whatnot, so I built code to download the data files or updates, parse them, and put them into a Sane Format (in my book, only CSV and JSON qualify right now). Then I wrote a simple tool to take any generic JSON and create tables from it and insert all the data. This just always seemed to be the right thing to do to me. Just hacking the file into a useable format and plunging ahead with analysis seemed like a bad option to me, but I take it from this article that it is the common approach?<p>It may just be an artifact of the kinds of systems I&#x27;ve worked on (bank, govt) but I&#x27;m not comfortable unless &#x27;deployment&#x27; consists of executing 1 script which can take a system from absolute barebones (no DB schema, no existing tables, no prearranged libraries, nothing) to production-ready. What if you have a catastrophe and your backups are hosed? What if you want to spin off a new environment for testing? The idea that there has to be existing state whose personal history is assumed to be in a certain state, or that after the system deploys someone has to grab some scripts out of their home directory and remember to apply them (and in the right order) before things can get going just terrifies me. What if that employee gets a brain tumor? I suppose it doesn&#x27;t matter quite as much if your system being down for 5 minutes doesn&#x27;t result in a report on the national news and impact hundreds of millions of people, but still... don&#x27;t most people have a personal investment in knowing their system isn&#x27;t just an array of spinning plates with a chasm of chaos awaiting an earthquake?
Blahahover 10 years ago
Beautiful idea, not dissimilar from dat [0] (if you haven&#x27;t already, you guys should talk).<p>I find the Django relationship to be an odd choice - the vast majority of people working with data are not using Django. Why pair the two?<p>[0]: dat-data.com
评论 #8369668 未加载
jbogganover 10 years ago
A lot of the nitty-gritty data munging and processing often gets discarded after a project or never included in the project repo in a meaningful way. I like Drake [0] because we used it a lot at Factual and it really made data generation and formatting very repeatable and easy.<p>I really think the packaging system the author is going for would be best built on top of Drake or a similar workflow management program. Instead of following their laundry list of configuration steps one could manage that automatically with a source-controlled workflow. Drake does have the advantages of non-linear and async workflows being pretty easy to build, maintain, and update.<p>What I would love to see is a data package manager that downloads the raw data and processing workflow, updates any software packages needed to run the workflow, and then spits out the data in the form you need it, whether CSV&#x2F;TSV&#x2F;JSON&#x2F;etc. I don&#x27;t know much about dat yet, but it looks like it would be a good end-point for serving the data as well.<p>0 - <a href="https://github.com/Factual/drake" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Factual&#x2F;drake</a>
评论 #8369952 未加载
mshronover 10 years ago
Love the idea!<p>I would ask for a little more separation of concerns. One package for raw but cleaned data with a collection of schemas, and a second for loading arbitrary data + schemas in to Django (and probably accomplishing all of the extra administrative steps provided in the example).<p>That way if I want to add other schemas for a non-Django use in the same package (say if I care more about analysis than clicky-interfaces) or not use Django, I can still use a package manager for the same data.
评论 #8369661 未加载
palewireover 10 years ago
A humble suggestion from your friends at the California Civic Data Coalition