Hi,<p>I was working on tag based search for The Pile dataset, you can get the whole torrent from: <a href="https://academictorrents.com/details/0d366035664fdf51cfbe9f733953ba325776e667" rel="nofollow">https://academictorrents.com/details/0d366035664fdf51cfbe9f7...</a><p>I index the whole dataset in a Lucene index and also run the documents through DMM (just as proof of concept) to extract some tags and then index those as well. The whole index + data is 900gb.<p>The cool part is you can just get a stream of documents to feed into your LLM for fine tuning, e.g.: stream link: "<a href="http://167.235.117.207/stream/+feynman%20+pile_set_name:ArXiv/1000" rel="nofollow">http://167.235.117.207/stream/+feynman%20+pile_set_name:ArXi...</a>" will give you 1000 documents matching on feynman from the ArXiv dataset, that you can just requests.get() into your notebook.<p>I rented 80core 256gb ram machine from hetzner for 250 euro, but decided am not gonna pay for it next month, so decided to share the project before I delete it.<p>I will push the code soon (just have to clean it up a bit), but not sure when I can do it and the machine's timer is ticking, so I decided to show it now.<p>You can use the lucene query parser syntax to search, including phrases and etc (<a href="https://lucene.apache.org/core/2_9_4/queryparsersyntax.html" rel="nofollow">https://lucene.apache.org/core/2_9_4/queryparsersyntax.html</a>)<p>There are two main problems I was looking to work on, one is fine tuning Large Language Models, and the other is working on "Offline" internet, having local curated subsets of the internet that are not AI generated.<p>I think that in the next year most of the internet will be generated by LLMs, but storage being so cheap, maybe we can build local libraries and we can share them, like 20 years ago with cassette tapes.<p>PS: If HN hugs the machine to death, I took a screenshot of how the page looks: <a href="https://punkx.org/feynman.png" rel="nofollow">https://punkx.org/feynman.png</a> at the top is the query input, the left is tags and datasets, and the right is sample of the first 3000 bytes of 1% of the documents