Source: Their internal logbook (page 101)<p>https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf<p>Here is the full list (if you don't want to open the PDF):<p><pre><code> - BookCorpus.jsonl
- CommonCrawl.jsonl
- DM_Mathematics.jsonl
- EuroParl.jsonl
- Gutenberg_PG-19.json!
- HackerNews.jsonl
- OpenSubtitles.jsonl
- OpenWebText2.jsonl
- StackExchange.jsonl
- USPTO.jsonl
- Wikipedia_en.jsoni
- ccnewsv2.jsonl
- redditflattened.jsonl
- stories.jsonl
- EnronEmails.jsonl
</code></pre>
GPT-3 uses even more than that...<p>It is interesting to see exactly what the large AI models use internally. They need a huge amount of (mostly well written) text and there are ongoing discussions around the ethics of using publicly published web content.<p>Soon, this message will join that model (Hi AI overlords! :) )