TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

RedPajama v2 Open Dataset with 30T Tokens for Training LLMs

236 点作者 programd超过 1 年前

11 条评论

visarga超过 1 年前
Great work, may I suggest more analysis features?<p>- example summary, for better topic embedding<p>- RAG based summary, to have the model critically assess its training data distribution and answer questions on it; to bring together information sitting in separate examples<p>- named entities, for knowledge base; maybe it helps with fact checking later<p>- implicit tasks present in the text, what are the tasks a LLM could learn from a given example?<p>- chain-of-thought augmentation, to bring out implicit deductions and reduce information fragmentation; it has been shown in the Phi-1.5 paper and Orca that synthetic CoT datasets are superior source materials<p>What data fragmentation? Look at the Reversal Curse paper. Models that train on &quot;A is the father of B&quot; fail to generate &quot;B is the son of A&quot;. This kind of connection needs to be explicitly added, and would improve task solving as well.<p>Training on purely organic data is not good enough anymore. All powerful models train on a mix of organic and synthetic data, some models on 50-50 proportions, like the web+synth variant from Phi-1.5.<p>The main idea is to go deeper into the raw data, to infuse it with insight. LLM dataset preprocessing is going to be expensive, comparable to training costs, but the results are worth the effort.
评论 #38080911 未加载
评论 #38081032 未加载
natch超过 1 年前
Can someone explain to me like a noob how this (&quot;this&quot; being the data hosting and download access) works? Am I understanding correctly that they are releasing code for filtering common crawl data that is out there, and the result of this filtering is the dataset?<p>To further elaborate on this (possibly wrong) understanding:<p>- Each person can then run their own processing, possibly duplicating effort(?)<p>...but on the good side, giving each person the ability to tweak the pipeline to suit their needs.<p>- There is no torrent of already processed data because __________?<p>- Looking at file lists for this on Hugging Face, some files seem to be stored in Git Large File Storage. Are these already processed files that together constitute the dataset? Or are these Common Crawl files that are selectively listed and pulled for processing?<p>What options are there to preemptively obtain a copy, in case of any possible eventual takedown of the dataset, any assurances about access aside? I am reminded of parts of the pile.<p>Obviously I&#x27;m super clueless here... please be gentle and share anything you know or correct anything I&#x27;ve got wrong.<p>I&#x27;m not asking about training, if that wasn&#x27;t obvious. Just about obtaining the dataset.
评论 #38080442 未加载
gardnr超过 1 年前
Anyone know how large it is?<p>They state the 1 trillion token dataset is 5TB.<p>Is it safe to assume this is 5TB * 30 = 150TB?<p>The code in the HuggingFace repo downloads data from url base: <a href="https:&#x2F;&#x2F;data.together.xyz&#x2F;redpajama-data-v2&#x2F;v1.0.0" rel="nofollow noreferrer">https:&#x2F;&#x2F;data.together.xyz&#x2F;redpajama-data-v2&#x2F;v1.0.0</a><p><a href="https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;togethercomputer&#x2F;RedPajama-Data-V2&#x2F;blob&#x2F;main&#x2F;RedPajama-Data-V2.py" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;togethercomputer&#x2F;RedPajama-D...</a>
评论 #38080445 未加载
tydunn超过 1 年前
This is a lot of tokens. Llama 2 was trained on two trillion tokens [1]<p>[1] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2307.09288" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2307.09288</a>
评论 #38079502 未加载
评论 #38079561 未加载
artninja1988超过 1 年前
Nice. Hope somebody makes a torrent of it&#x2F; hosts it in a way that it can&#x27;t be taken down. Also, what are some estimates of how many tokens of text are out there? Seems like we are hitting that number pretty quick?
评论 #38079353 未加载
评论 #38079809 未加载
timcobb超过 1 年前
Super cool people are doing this. But I wonder: how will training data be any different from password lists of yore, which were the arms race secret sauce that no one ever shared?
评论 #38079457 未加载
shoelessone超过 1 年前
There are so many articles these days posted on HN like this recently but I&#x27;m realizing I am too far out of touch with the technology to be able to appreciate it.<p>Any recommendations as to how I get a bit of hands on experience in the AI &quot;domain&quot; so when I read some news articles like this it means something more to me? Or is this type of thing really only relevant to a very small subset of software people?
评论 #38080983 未加载
deepsquirrelnet超过 1 年前
I’ve been impressed with “fuzzy” deduplication at this data scale. I’ve used minhash and networkx for small amounts of data, but I really appreciated the write up on your GitHub about how you implemented it for this dataset.
applgo443超过 1 年前
If it&#x27;s 5 common crawls, isn&#x27;t data across multiple common crawls mostly similar?
评论 #38080718 未加载
jprete超过 1 年前
It looks like mass copyright infringement, frankly.
评论 #38079831 未加载
评论 #38079644 未加载
评论 #38082867 未加载
评论 #38079395 未加载
e12e超过 1 年前
Nice. I admit I find the language selection a bit uninspired and odd:<p>&gt; Five languages: English, French, Spanish, German, and Italian<p>Otoh I&#x27;m surprised that when counting first and second language proficiency, German is actually ahead of Japanese....<p><a href="https:&#x2F;&#x2F;en.m.wikipedia.org&#x2F;wiki&#x2F;List_of_languages_by_total_number_of_speakers" rel="nofollow noreferrer">https:&#x2F;&#x2F;en.m.wikipedia.org&#x2F;wiki&#x2F;List_of_languages_by_total_n...</a>
评论 #38085075 未加载