TechEcho

Source: Their internal logbook (page 101)https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdfHere is the full list (if you don't want to open the PDF):<pre><code> - BookCorpus.jsonl - CommonCrawl.jsonl - DM_Mathematics.jsonl - EuroParl.jsonl - Gutenberg_PG-19.json! - HackerNews.jsonl - OpenSubtitles.jsonl - OpenWebText2.jsonl - StackExchange.jsonl - USPTO.jsonl - Wikipedia_en.jsoni - ccnewsv2.jsonl - redditflattened.jsonl - stories.jsonl - EnronEmails.jsonl </code></pre> GPT-3 uses even more than that...It is interesting to see exactly what the large AI models use internally. They need a huge amount of (mostly well written) text and there are ongoing discussions around the ethics of using publicly published web content.Soon, this message will join that model (Hi AI overlords! :) )

Meta uses Hacker News (and more) in OPT (its version of GPT-3)

no comments

Meta uses Hacker News (and more) in OPT (its version of GPT-3)

no comments