The Reddit comment corpus is an awesome dataset. There's relatively little mark-up to scrub out, low duplication, good metadata, and a variety of topics.<p>We used it to train a syntax-enriched word2vec model. Write up and demo: <a href="https://explosion.ai/blog/sense2vec-with-spacy" rel="nofollow">https://explosion.ai/blog/sense2vec-with-spacy</a><p>Btw, the above was run on CPU in a couple of days, because spaCy doesn't use GPUs yet. I've applied for a grant from NVidia so I can fix that. If anyone from NVidia is reading, email me? :)