Where can I get large datasets open to the public?

215 pointsby helwrabout 14 years ago

22 comments

physcababout 14 years ago

Asking "What datasets are available to me?" is sometimes the wrong question. A better way of going about the problem is asking something more specific like "How can I create a heat-map of U.S poverty?" The reason why the latter is better is that it not only focuses your attention on something do-able but it actually teaches you more about data analysis than just searching for datasets.For example, to solve the question above you are going to be asking yourself the following followup questions:1) Where do I get a map of the U.S?2) How do I make a heat-map?3) How do I feed in my own data into this heat map?4) What colors do I use?5) Can I do this real-time? Do I need a database? What language do I use?6) Whats a FIPS code?7) How do I find a poverty dataset with FIPS codes?8) This poverty dataset doesn't have FIPS codes, but I can join it with this other dataset that does have FIPS codes.

评论 #2410071 未加载

评论 #2410721 未加载

评论 #2409938 未加载

machinespitabout 14 years ago

data.gov and other US gov data sites are getting severe cuts even though they're saving money (<a href="http://www.federalnewsradio.com/?nid=35&sid=2327798" rel="nofollow">http://www.federalnewsradio.com/?nid=35&sid=2327798</a>)Very upsetting for fans of open / accessible (government) data.FWIW, petition at <a href="http://sunlightfoundation.com/savethedata/" rel="nofollow">http://sunlightfoundation.com/savethedata/</a>

评论 #2410313 未加载

评论 #2409682 未加载

评论 #2411084 未加载

iamelgringoabout 14 years ago

Hackers & Founders SV is hosting a hackathon[1] in two weeks at the Hacker Dojo in Mountain View. It's going to be geared towards working with Factual's open data API.Factual's[2] goal is to provide an API to connect all those available data sets, and they have a fairly impressive list of data sets available. Factual is very interested in hearing what datasets you want to work with, and they are willing to bust ass to get them available before the hackathon.We still have around 40 RSVP slots open. You can register here: <a href="http://factualhackathon.eventbrite.com/" rel="nofollow">http://factualhackathon.eventbrite.com/</a></shameless plug>[1] <a href="http://www.hackersandfounders.com/events/16535156/" rel="nofollow">http://www.hackersandfounders.com/events/16535156/</a>[2] <a href="http://www.factual.com/" rel="nofollow">http://www.factual.com/</a>[3] <a href="http://factualhackathon.eventbrite.com/" rel="nofollow">http://factualhackathon.eventbrite.com/</a>

评论 #2411692 未加载

bigiainabout 14 years ago

<a href="http://jacquesmattheij.com/Free%2C+Public+Data+Sets" rel="nofollow">http://jacquesmattheij.com/Free%2C+Public+Data+Sets</a> And discussion: <a href="http://news.ycombinator.com/item?id=2165497" rel="nofollow">http://news.ycombinator.com/item?id=2165497</a>

bOR_about 14 years ago

<a href="http://www.hiv.lanl.gov/content/index" rel="nofollow">http://www.hiv.lanl.gov/content/index</a>For sentimental value: HIV sequence data (and other data) from 1980 till now. Did my thesis on these ;-).In general, there is an enormous amount of gene sequence data around, not just HIV.<a href="http://www.ncbi.nlm.nih.gov/sites/" rel="nofollow">http://www.ncbi.nlm.nih.gov/sites/</a>Whole genome sequences of eukaryotes (including humans): <a href="http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi" rel="nofollow">http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi</a>

评论 #2411336 未加载

shiiabout 14 years ago

<a href="http://www.reddit.com/r/datasets/" rel="nofollow">http://www.reddit.com/r/datasets/</a>

svagabout 14 years ago

Previous discussions:<a href="http://news.ycombinator.com/item?id=2165497" rel="nofollow">http://news.ycombinator.com/item?id=2165497</a> <a href="http://news.ycombinator.com/item?id=764982" rel="nofollow">http://news.ycombinator.com/item?id=764982</a> <a href="http://news.ycombinator.com/item?id=1024966" rel="nofollow">http://news.ycombinator.com/item?id=1024966</a>

espeedabout 14 years ago

Linked Data Sets <a href="http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets" rel="nofollow">http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingO...</a>Web Services Directory <a href="http://www.programmableweb.com/apis/directory/1?sort=mashups" rel="nofollow">http://www.programmableweb.com/apis/directory/1?sort=mashups</a>

raghusabout 14 years ago

Also, check out <a href="http://aws.amazon.com/datasets" rel="nofollow">http://aws.amazon.com/datasets</a>

drblastabout 14 years ago

Edit: Whoops, I thought this was an "Ask HN." The below post still stands for anyone who finds it useful.The U.S. Census has an extremely well-documented large data set:<a href="http://www2.census.gov/census_2000/datasets/" rel="nofollow">http://www2.census.gov/census_2000/datasets/</a>And the documentation is here:<a href="http://www.census.gov/prod/cen2000/doc/sf1.pdf" rel="nofollow">http://www.census.gov/prod/cen2000/doc/sf1.pdf</a>The software that they provide to go through the data is crappy, however (90's era).I have an equally crappy but more useful to a computer scientist Common Lisp program that will pull out specific fields from the data set based on a list of field names. If you want that, I can dig it up for you.Also, before you start parsing this, it's worthwhile to read the documentation to find out how the files are laid out, and what each field really means. These files are not relational databases, so if you're looking at it through those lenses, confusion will result. In particular, some things are already aggregated within the data set.

barefootabout 14 years ago

How many of these allow me to create for-profit websites with them?

Maroabout 14 years ago

There's a startup called kaggle.com that is all about hosting data mining competitions around datasets, like netflix.

bussabout 14 years ago

<a href="http://aws.amazon.com/publicdatasets/" rel="nofollow">http://aws.amazon.com/publicdatasets/</a> which includes my former advisor's dataset (UF sparse matrix collection) which includes a matrix or two from my research.

latchabout 14 years ago

I believe Steven Levitt used the Fatality Analysis Reporting System (FARS) from the national highway traffic safety administration (NHTSA) for his seatbelts vs carseats work:ftp://ftp.nhtsa.dot.gov/fars/

nowarninglabelabout 14 years ago

At <a href="http://build.kiva.org" rel="nofollow">http://build.kiva.org</a> there are some nice datasets in the "data snapshots" section. I have high hopes we will be releasing a lot more data.

brandnewlowabout 14 years ago

On that topic, anyone have any suggestions for the easiest way to prepopulate a directory of local businesses in the U.S.?

评论 #2409517 未加载

arethuzaabout 14 years ago

UK Government data sets: <a href="http://data.gov.uk/" rel="nofollow">http://data.gov.uk/</a>

shafqatabout 14 years ago

We provide API access to more than 20 million articles (headlines, excerpts). People have done all sorts of interesting things with it - <a href="http://platform.newscred.com" rel="nofollow">http://platform.newscred.com</a>.

kordlessabout 14 years ago

Infochimps?

thesuperformulaabout 14 years ago

You can find many large datasets here, <a href="http://beta.fcc.gov/data/download-fcc-datasets" rel="nofollow">http://beta.fcc.gov/data/download-fcc-datasets</a> , some are over a gigabyte.

plannerballabout 14 years ago

Freebase?

mrzergaabout 14 years ago

microsoft azure - they have some large datasets...