TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Datasets for Machine Learning

456 点作者 mromaine将近 7 年前

15 条评论

benhamner将近 7 年前
Ben from Kaggle.<p>Open up the ~50 different individual datasets linked in separate tabs, and then quickly flip through all of them trying to get a sense of what each one is.<p>That experience will demonstrate one of the main challenges we&#x27;re aiming to solve by making Kaggle Datasets your default place to publish data online (<a href="https:&#x2F;&#x2F;www.kaggle.com&#x2F;datasets" rel="nofollow">https:&#x2F;&#x2F;www.kaggle.com&#x2F;datasets</a>)
评论 #17310443 未加载
评论 #17311015 未加载
评论 #17313152 未加载
评论 #17310206 未加载
评论 #17310775 未加载
评论 #17310321 未加载
评论 #17319024 未加载
logancg将近 7 年前
The link at the bottom should be emphasized: <a href="https:&#x2F;&#x2F;github.com&#x2F;awesomedata&#x2F;awesome-public-datasets" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;awesomedata&#x2F;awesome-public-datasets</a><p>It is a very expansive collection of datasets, some well-prepped for ML and most not (which is part of the fun of it, anyways).
danso将近 7 年前
Two sources that are missing:<p>opendatanetwork.com: this is effectively a Google for public Socrata data portals, and for me, the best way to discover datasets across different municipalities. For example, when I was interested in trying to replicate the NYT&#x27;s <i>&quot;Do ‘Fast and Furious’ Movies Cause a Rise in Speeding?&quot;</i> [0] article, it was pretty easy to find a bunch of other traffic&#x2F;motor vehicle violation datasets with opendatanetwork&#x27;s search.<p>Enigma public (<a href="https:&#x2F;&#x2F;public.enigma.com" rel="nofollow">https:&#x2F;&#x2F;public.enigma.com</a>): a huge collection of scraped public datasets, including flattened versions of data that originally comes in annoying-to-parse, such as U.S. lobbying disclosures [1]<p>[0] <a href="https:&#x2F;&#x2F;www.nytimes.com&#x2F;2018&#x2F;01&#x2F;30&#x2F;upshot&#x2F;do-fast-and-furious-movies-cause-a-rise-in-speeding.html" rel="nofollow">https:&#x2F;&#x2F;www.nytimes.com&#x2F;2018&#x2F;01&#x2F;30&#x2F;upshot&#x2F;do-fast-and-furiou...</a><p>[1] <a href="https:&#x2F;&#x2F;public.enigma.com&#x2F;datasets&#x2F;lobbying-disclosures-lobbyists-2013&#x2F;f3ce179f-9171-4754-9f71-71d7596d900a?&amp;filter=%2B%5B%3E%5Blobbyist%5D%5D" rel="nofollow">https:&#x2F;&#x2F;public.enigma.com&#x2F;datasets&#x2F;lobbying-disclosures-lobb...</a>
andy-wu将近 7 年前
Surprised that CIFAR wasn’t mentioned under Images. I feel like that’s one of the standards, even more so than some of the ones that are listed.
rerx将近 7 年前
To train machine translation models parallel corpora in many languages are provided on the WMT conference site: <a href="http:&#x2F;&#x2F;www.statmt.org&#x2F;wmt17&#x2F;translation-task.html" rel="nofollow">http:&#x2F;&#x2F;www.statmt.org&#x2F;wmt17&#x2F;translation-task.html</a> and previous years
Smerity将近 7 年前
My original comment was meant for a separate HN article on machine learning and I posted in the wrong tab.<p>My apologies.
评论 #17309897 未加载
评论 #17309898 未加载
loisaidasam将近 7 年前
Inspired by this post, I was looking for a fun way to browse datasets randomly, which led me to build this Kaggle Random Dataset Generator:<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=17313374" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=17313374</a><p>Thanks Gengo!
mohi13将近 7 年前
Here are 1000s of more open datasets for anyone to explore, use or build upon: <a href="https:&#x2F;&#x2F;dataturks.com&#x2F;projects&#x2F;trending" rel="nofollow">https:&#x2F;&#x2F;dataturks.com&#x2F;projects&#x2F;trending</a>
rahimnathwani将近 7 年前
From the title &#x27;The 50 Best Free Datasets...&#x27; I was expecting a curated list of datasets. But the list has mix of individual datasets, and sites that provide&#x2F;host datasets :(
codemetro53将近 7 年前
Here is a dataset for abstractive summarization created from Reddit .<p>Dataset <a href="https:&#x2F;&#x2F;zenodo.org&#x2F;record&#x2F;1168855#.WyJG3I7pdhE" rel="nofollow">https:&#x2F;&#x2F;zenodo.org&#x2F;record&#x2F;1168855#.WyJG3I7pdhE</a> Paper <a href="http:&#x2F;&#x2F;aclweb.org&#x2F;anthology&#x2F;W17-4508" rel="nofollow">http:&#x2F;&#x2F;aclweb.org&#x2F;anthology&#x2F;W17-4508</a>
mrphilroth将近 7 年前
Security industry related datasets always seem to be omitted from this type of thing. Please check out the excellent <a href="http:&#x2F;&#x2F;www.secrepo.com&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.secrepo.com&#x2F;</a>.
kokimame将近 7 年前
For audio, LibriSpeech, M-AILABS, LJ-Speech, VCTK, TIMIT, Mocha-Timit, VoxForge, Blizzard Challenge, and so on.
greentuna将近 7 年前
Does anyone know of good datasets for Concept Drift analysis?
bhnmmhmd将近 7 年前
Can these datasets be used for academic and research purposes?
fwdpropaganda将近 7 年前
Can&#x27;t open this website.
评论 #17310969 未加载