TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Search the Pile: An 800GB Dataset of Diverse Text for Language Modeling

2 点作者 throwaway71271超过 2 年前

1 comment

throwaway71271超过 2 年前
Hi,<p>I was working on tag based search for The Pile dataset, you can get the whole torrent from: <a href="https:&#x2F;&#x2F;academictorrents.com&#x2F;details&#x2F;0d366035664fdf51cfbe9f733953ba325776e667" rel="nofollow">https:&#x2F;&#x2F;academictorrents.com&#x2F;details&#x2F;0d366035664fdf51cfbe9f7...</a><p>I index the whole dataset in a Lucene index and also run the documents through DMM (just as proof of concept) to extract some tags and then index those as well. The whole index + data is 900gb.<p>The cool part is you can just get a stream of documents to feed into your LLM for fine tuning, e.g.: stream link: &quot;<a href="http:&#x2F;&#x2F;167.235.117.207&#x2F;stream&#x2F;+feynman%20+pile_set_name:ArXiv&#x2F;1000" rel="nofollow">http:&#x2F;&#x2F;167.235.117.207&#x2F;stream&#x2F;+feynman%20+pile_set_name:ArXi...</a>&quot; will give you 1000 documents matching on feynman from the ArXiv dataset, that you can just requests.get() into your notebook.<p>I rented 80core 256gb ram machine from hetzner for 250 euro, but decided am not gonna pay for it next month, so decided to share the project before I delete it.<p>I will push the code soon (just have to clean it up a bit), but not sure when I can do it and the machine&#x27;s timer is ticking, so I decided to show it now.<p>You can use the lucene query parser syntax to search, including phrases and etc (<a href="https:&#x2F;&#x2F;lucene.apache.org&#x2F;core&#x2F;2_9_4&#x2F;queryparsersyntax.html" rel="nofollow">https:&#x2F;&#x2F;lucene.apache.org&#x2F;core&#x2F;2_9_4&#x2F;queryparsersyntax.html</a>)<p>There are two main problems I was looking to work on, one is fine tuning Large Language Models, and the other is working on &quot;Offline&quot; internet, having local curated subsets of the internet that are not AI generated.<p>I think that in the next year most of the internet will be generated by LLMs, but storage being so cheap, maybe we can build local libraries and we can share them, like 20 years ago with cassette tapes.<p>PS: If HN hugs the machine to death, I took a screenshot of how the page looks: <a href="https:&#x2F;&#x2F;punkx.org&#x2F;feynman.png" rel="nofollow">https:&#x2F;&#x2F;punkx.org&#x2F;feynman.png</a> at the top is the query input, the left is tags and datasets, and the right is sample of the first 3000 bytes of 1% of the documents