Ben from Kaggle.<p>Open up the ~50 different individual datasets linked in separate tabs, and then quickly flip through all of them trying to get a sense of what each one is.<p>That experience will demonstrate one of the main challenges we're aiming to solve by making Kaggle Datasets your default place to publish data online (<a href="https://www.kaggle.com/datasets" rel="nofollow">https://www.kaggle.com/datasets</a>)
The link at the bottom should be emphasized: <a href="https://github.com/awesomedata/awesome-public-datasets" rel="nofollow">https://github.com/awesomedata/awesome-public-datasets</a><p>It is a very expansive collection of datasets, some well-prepped for ML and most not (which is part of the fun of it, anyways).
Two sources that are missing:<p>opendatanetwork.com: this is effectively a Google for public Socrata data portals, and for me, the best way to discover datasets across different municipalities. For example, when I was interested in trying to replicate the NYT's <i>"Do ‘Fast and Furious’ Movies Cause a Rise in Speeding?"</i> [0] article, it was pretty easy to find a bunch of other traffic/motor vehicle violation datasets with opendatanetwork's search.<p>Enigma public (<a href="https://public.enigma.com" rel="nofollow">https://public.enigma.com</a>): a huge collection of scraped public datasets, including flattened versions of data that originally comes in annoying-to-parse, such as U.S. lobbying disclosures [1]<p>[0] <a href="https://www.nytimes.com/2018/01/30/upshot/do-fast-and-furious-movies-cause-a-rise-in-speeding.html" rel="nofollow">https://www.nytimes.com/2018/01/30/upshot/do-fast-and-furiou...</a><p>[1] <a href="https://public.enigma.com/datasets/lobbying-disclosures-lobbyists-2013/f3ce179f-9171-4754-9f71-71d7596d900a?&filter=%2B%5B%3E%5Blobbyist%5D%5D" rel="nofollow">https://public.enigma.com/datasets/lobbying-disclosures-lobb...</a>
To train machine translation models parallel corpora in many languages are provided on the WMT conference site: <a href="http://www.statmt.org/wmt17/translation-task.html" rel="nofollow">http://www.statmt.org/wmt17/translation-task.html</a> and previous years
Inspired by this post, I was looking for a fun way to browse datasets randomly, which led me to build this Kaggle Random Dataset Generator:<p><a href="https://news.ycombinator.com/item?id=17313374" rel="nofollow">https://news.ycombinator.com/item?id=17313374</a><p>Thanks Gengo!
Here are 1000s of more open datasets for anyone to explore, use or build upon:
<a href="https://dataturks.com/projects/trending" rel="nofollow">https://dataturks.com/projects/trending</a>
From the title 'The 50 Best Free Datasets...' I was expecting a curated list of datasets. But the list has mix of individual datasets, and sites that provide/host datasets :(
Here is a dataset for abstractive summarization created from Reddit .<p>Dataset <a href="https://zenodo.org/record/1168855#.WyJG3I7pdhE" rel="nofollow">https://zenodo.org/record/1168855#.WyJG3I7pdhE</a>
Paper <a href="http://aclweb.org/anthology/W17-4508" rel="nofollow">http://aclweb.org/anthology/W17-4508</a>
Security industry related datasets always seem to be omitted from this type of thing. Please check out the excellent <a href="http://www.secrepo.com/" rel="nofollow">http://www.secrepo.com/</a>.