科技回声

15 条评论

benhamner将近 7 年前

Ben from Kaggle.Open up the ~50 different individual datasets linked in separate tabs, and then quickly flip through all of them trying to get a sense of what each one is.That experience will demonstrate one of the main challenges we're aiming to solve by making Kaggle Datasets your default place to publish data online (<a href="https://www.kaggle.com/datasets" rel="nofollow">https://www.kaggle.com/datasets</a>)

评论 #17310443 未加载

评论 #17311015 未加载

评论 #17313152 未加载

评论 #17310206 未加载

评论 #17310775 未加载

评论 #17310321 未加载

评论 #17319024 未加载

logancg将近 7 年前

The link at the bottom should be emphasized: <a href="https://github.com/awesomedata/awesome-public-datasets" rel="nofollow">https://github.com/awesomedata/awesome-public-datasets</a>It is a very expansive collection of datasets, some well-prepped for ML and most not (which is part of the fun of it, anyways).

danso将近 7 年前

Two sources that are missing:opendatanetwork.com: this is effectively a Google for public Socrata data portals, and for me, the best way to discover datasets across different municipalities. For example, when I was interested in trying to replicate the NYT's "Do ‘Fast and Furious’ Movies Cause a Rise in Speeding?" [0] article, it was pretty easy to find a bunch of other traffic/motor vehicle violation datasets with opendatanetwork's search.Enigma public (<a href="https://public.enigma.com" rel="nofollow">https://public.enigma.com</a>): a huge collection of scraped public datasets, including flattened versions of data that originally comes in annoying-to-parse, such as U.S. lobbying disclosures [1][0] <a href="https://www.nytimes.com/2018/01/30/upshot/do-fast-and-furious-movies-cause-a-rise-in-speeding.html" rel="nofollow">https://www.nytimes.com/2018/01/30/upshot/do-fast-and-furiou...</a>[1] <a href="https://public.enigma.com/datasets/lobbying-disclosures-lobbyists-2013/f3ce179f-9171-4754-9f71-71d7596d900a?&filter=%2B%5B%3E%5Blobbyist%5D%5D" rel="nofollow">https://public.enigma.com/datasets/lobbying-disclosures-lobb...</a>

andy-wu将近 7 年前

Surprised that CIFAR wasn’t mentioned under Images. I feel like that’s one of the standards, even more so than some of the ones that are listed.

rerx将近 7 年前

To train machine translation models parallel corpora in many languages are provided on the WMT conference site: <a href="http://www.statmt.org/wmt17/translation-task.html" rel="nofollow">http://www.statmt.org/wmt17/translation-task.html</a> and previous years

Smerity将近 7 年前

My original comment was meant for a separate HN article on machine learning and I posted in the wrong tab.My apologies.

评论 #17309897 未加载

评论 #17309898 未加载

loisaidasam将近 7 年前

Inspired by this post, I was looking for a fun way to browse datasets randomly, which led me to build this Kaggle Random Dataset Generator:<a href="https://news.ycombinator.com/item?id=17313374" rel="nofollow">https://news.ycombinator.com/item?id=17313374</a>Thanks Gengo!

mohi13将近 7 年前

Here are 1000s of more open datasets for anyone to explore, use or build upon: <a href="https://dataturks.com/projects/trending" rel="nofollow">https://dataturks.com/projects/trending</a>

rahimnathwani将近 7 年前

From the title 'The 50 Best Free Datasets...' I was expecting a curated list of datasets. But the list has mix of individual datasets, and sites that provide/host datasets :(

codemetro53将近 7 年前

Here is a dataset for abstractive summarization created from Reddit .Dataset <a href="https://zenodo.org/record/1168855#.WyJG3I7pdhE" rel="nofollow">https://zenodo.org/record/1168855#.WyJG3I7pdhE</a> Paper <a href="http://aclweb.org/anthology/W17-4508" rel="nofollow">http://aclweb.org/anthology/W17-4508</a>

mrphilroth将近 7 年前

Security industry related datasets always seem to be omitted from this type of thing. Please check out the excellent <a href="http://www.secrepo.com/" rel="nofollow">http://www.secrepo.com/</a>.

kokimame将近 7 年前

For audio, LibriSpeech, M-AILABS, LJ-Speech, VCTK, TIMIT, Mocha-Timit, VoxForge, Blizzard Challenge, and so on.

greentuna将近 7 年前

Does anyone know of good datasets for Concept Drift analysis?

bhnmmhmd将近 7 年前

Can these datasets be used for academic and research purposes?

fwdpropaganda将近 7 年前

Can't open this website.

评论 #17310969 未加载

15 条评论

benhamner将近 7 年前

评论 #17310443 未加载

评论 #17311015 未加载

评论 #17313152 未加载

评论 #17310206 未加载

评论 #17310775 未加载

评论 #17310321 未加载

评论 #17319024 未加载

logancg将近 7 年前

danso将近 7 年前

andy-wu将近 7 年前

Surprised that CIFAR wasn’t mentioned under Images. I feel like that’s one of the standards, even more so than some of the ones that are listed.

rerx将近 7 年前

Smerity将近 7 年前

My original comment was meant for a separate HN article on machine learning and I posted in the wrong tab.My apologies.

评论 #17309897 未加载

评论 #17309898 未加载

loisaidasam将近 7 年前

mohi13将近 7 年前

Here are 1000s of more open datasets for anyone to explore, use or build upon: <a href="https://dataturks.com/projects/trending" rel="nofollow">https://dataturks.com/projects/trending</a>

rahimnathwani将近 7 年前

From the title 'The 50 Best Free Datasets...' I was expecting a curated list of datasets. But the list has mix of individual datasets, and sites that provide/host datasets :(

codemetro53将近 7 年前

mrphilroth将近 7 年前

Security industry related datasets always seem to be omitted from this type of thing. Please check out the excellent <a href="http://www.secrepo.com/" rel="nofollow">http://www.secrepo.com/</a>.

kokimame将近 7 年前

For audio, LibriSpeech, M-AILABS, LJ-Speech, VCTK, TIMIT, Mocha-Timit, VoxForge, Blizzard Challenge, and so on.

greentuna将近 7 年前

Does anyone know of good datasets for Concept Drift analysis?

bhnmmhmd将近 7 年前

Can these datasets be used for academic and research purposes?

fwdpropaganda将近 7 年前

Can't open this website.

评论 #17310969 未加载

Datasets for Machine Learning

15 条评论

Datasets for Machine Learning

15 条评论