If you want to play around with data, here's another good list of open/free datasets: <a href="http://bitly.com/bundles/hmason/1" rel="nofollow">http://bitly.com/bundles/hmason/1</a>
here's some other data hubs/search engines, endless lists:<p><a href="http://datahub.io/" rel="nofollow">http://datahub.io/</a><p><a href="http://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public-data-sources/" rel="nofollow">http://blog.bigml.com/2013/02/28/data-data-data-thousands-of...</a><p><a href="http://tm.durusau.net/?p=39312" rel="nofollow">http://tm.durusau.net/?p=39312</a><p><a href="http://dvn.iq.harvard.edu/dvn/" rel="nofollow">http://dvn.iq.harvard.edu/dvn/</a><p>_____________<p>this subreddit seems like a decent place to ask questions<p><a href="http://www.reddit.com/r/datasets" rel="nofollow">http://www.reddit.com/r/datasets</a>
Another one from Google, 1000 scanned books for OCR and other scanned document processing research: <a href="http://commondatastorage.googleapis.com/books/icdar2007/README.txt" rel="nofollow">http://commondatastorage.googleapis.com/books/icdar2007/READ...</a>
BitTorrent Please!
Why does it cost so much? They grabbed our data for free and they have enough free Bandwidth. Let's assume they are greedy, then they could at least offer it through BitTorrent. DVD's for that amount of data is ridiculous. I don't even have a DVD-Reader…<p><i></i><i>Can't afford buying all that + shipping to Europe, but would like to play with the Data for my NLP Project.</i><i></i>
Here is a good one, <a href="http://cleandatahub.org/" rel="nofollow">http://cleandatahub.org/</a> They are trying to aggregate cleaned data sets across the web.
no links...<p>Remember the days when people used to make links on the web because they weren't greedy with their pagerank?<p>At least Google left us some machine learning data sets after they took all the links. You just can't find them because nobody links to them.
Fantastic links throughout this thread.<p>When playing with new programming languages instead of a 'todo' list I always end up building an XKCD password generator. Interestingly enough, I've never found a frequency/comprehension list worth using to populate it for public consumption.
The ML competition site Kaggle should also get a mention here. <a href="http://www.kaggle.com/competitions" rel="nofollow">http://www.kaggle.com/competitions</a>