Big LLMs weights are a piece of history

301 点作者 freeatnet2 个月前

28 条评论

I love the title "Big LLMs" because it means that we are now making a distinction between big LLMs and minute LLMs and maybe medium LLMs. I'd like to propose the we call them "Tall LLMs", "Grande LLMs", and "Venti LLMs" just to be precise.

评论 #43380740 未加载

评论 #43380587 未加载

评论 #43379431 未加载

评论 #43379704 未加载

评论 #43380392 未加载

评论 #43381899 未加载

评论 #43379315 未加载

评论 #43379509 未加载

评论 #43379821 未加载

评论 #43379375 未加载

评论 #43379738 未加载

评论 #43380311 未加载

评论 #43379326 未加载

评论 #43380279 未加载

评论 #43379732 未加载

评论 #43380516 未加载

评论 #43380631 未加载

评论 #43381241 未加载

评论 #43384455 未加载

评论 #43379283 未加载

评论 #43380504 未加载

评论 #43380460 未加载

评论 #43380650 未加载

评论 #43382243 未加载

dr_dshiv2 个月前

“We should regard the Internet Archive as one of the most valuable pieces of modern history; instead, many companies and entities make the chances of the Archive to survive, and accumulate what otherwise will be lost, harder and harder. I understand that the Archive headquarters are located in what used to be a church: well, there is no better way to think of it than as a sacred place.”Amen. There is an active effort to create an Internet Archive based in Europe, just… in case.

评论 #43382397 未加载

评论 #43380139 未加载

评论 #43380335 未加载

jart2 个月前

Mozilla's llamafile project is designed to enable LLMs to be preserved for historical purposes. They ship the weights and all the necessary software in a deterministic dependency-free single-file executable. If you save your llamafiles, you should be able to run them in fifty years and have the outputs be exactly the same as what you'd get today. Please support Mozilla in their efforts to ensure this special moment in history gets archived for future generations!<a href="https://github.com/Mozilla-Ocho/llamafile/" rel="nofollow">https://github.com/Mozilla-Ocho/llamafile/</a>

评论 #43379938 未加载

GeoAtreides2 个月前

Just like the map isn't the territory, so summaries are not the content nor the library fillings the actual books.If I want to read a post, a book, a forum, I want to read exactly that, not a simulacrum built by arcane mathematical algorithms.

评论 #43379913 未加载

评论 #43381133 未加载

api2 个月前

That's really what these are: something analogous to JPEG for language, and queryable in natural language.Tangent: I was thinking the other day: these are not AI in the sense that they are not primarily intelligence. I still don't see much evidence of that. What they do give me is superhuman memory. The main thing I use them for is search, research, and a "rubber duck" that talks back, and it's like having an intern who has memorized the library and the entire Internet. They occasionally hallucinate or make mistakes -- compression artifacts -- but it's there.So it's more AM -- artificial memory.Edit: as a reply pointed out: this is Vannevar Bush's Memex, kind of.

评论 #43379066 未加载

评论 #43379057 未加载

评论 #43379450 未加载

评论 #43378847 未加载

评论 #43379345 未加载

评论 #43379959 未加载

评论 #43379221 未加载

评论 #43379952 未加载

评论 #43379053 未加载

评论 #43379046 未加载

评论 #43379690 未加载

laborcontract2 个月前

I miss the good ol days when I'd have text-davinci make me a table of movies that included a link to the movie poster. It usually generated a url of an image in an s3 bucket. The link always worked.

andix2 个月前

I think it’s fine that not everything on the internet is archived forever.It has always been like that, in the past people wrote on paper, and most of it was never archived. At some point it was just lost.I inherited many boxes of notes, books and documents from my grandparents. Most of it was just meaningless to me. I had to throw away a lot of it and only kept a few thousand pages of various documents. The other stuff is just lost forever. And that’s probably fine.Archives are very important, but nowadays the most difficult part is to select what to archive. There is so much content added to the internet every second, only a fraction of it can be archived.

hedgehog2 个月前

This doesn't make much sense to me. Unattributed heresay has limited historical value, perhaps zero given that the view of the web most of the weights-available models have is Common Crawl which is itself available for preservation.

评论 #43384495 未加载

fl4tul42 个月前

> Scientific papers and processes that are lost forever as publishers fail, their websites shut down.I don't think the big scientific publishers (now, in our time) will ever fail, they are RICH!

评论 #43382445 未加载

评论 #43380438 未加载

评论 #43380142 未加载

nickpsecurity2 个月前

People wanting this would be better off using memory architectures, like how the brain does it. For ML, the simplest approach is putting in memory layers with content-addressible schemes. I have a few links on prototypes in this comment:<a href="https://news.ycombinator.com/item?id=42824960">https://news.ycombinator.com/item?id=42824960</a>

评论 #43379270 未加载

dstroot2 个月前

Isn’t big LLM training data actually the most analogous to the internet archive? Shouldn’t the title be “Big LLM training data is a piece of history”? Especially at this point in history since a large portion of internet data going forward will be LLM generated and not human generated? It’s kind of the last snapshot of human-created content.

评论 #43380857 未加载

ilaksh2 个月前

Great idea. Slightly related idea: use the Internet Archive to build a dataset of 6502 machine code/binaries, corresponding manuals, and possibly videos of the software in action.. maybe emulator traces.It might be possible to create an L LM that can write a custom vintage game or program on demand in machine code and simultaneously generate assets like sprites. Especially if you use the latest reinforcement learning techniques.

rollcat2 个月前

<a href="https://xkcd.com/1683/" rel="nofollow">https://xkcd.com/1683/</a>

hi_hi2 个月前

Naming antics aside, the article makes a good point I've heard previously about the importance of the Internet Archive.Are there any search experiences that allow me to search like it's 1999? I'd love to be able to re-create the experience of finding random passion project blogs that give a small snapshot of things people and business were using the web for back then.

OuterVale2 个月前

Interesting. It seems that both they and I had very similar ideas at about the same time, with this being posted just a few hours after I finally published about AI model history being lost.<a href="https://vale.rocks/posts/ai-model-history-is-being-lost" rel="nofollow">https://vale.rocks/posts/ai-model-history-is-being-lost</a>

Havoc2 个月前

I wonder whether it'll become like pre-WW2 steel that doesn't have nuclear contamination.Just with a pre-LLM knowledge

dmos622 个月前

Enjoy the insight, but the title makes my eye twitch. How about "LLM weights are pieces of history"?

评论 #43380178 未加载

pama2 个月前

I would be curious to know if it would be possible to recunstruct approximate versions of popular common subsets of internet training data by using many different LLMs that may have happened to read the same info. Anyone knows pointers to math papers about such things?

teleforce2 个月前

I really like the narative that now LLM is the conserving human knowledge that otherwise would be lost forever in the form of its weights in a kind of a lossy compression.Personally I'd like that if all the knowledge and information (K & I) are readily available and accessible (pretty sure most of the prople share the same sentiment), despite the consistent business decisions from the copyright holders to hoard their K & I by putting everything behind paywalls and/or registration (I'm looking at you Apple and X/Twitter). As much that some people hate Google by organizing the world information by feeding and thriving through advertisements because in the long run the information do get organized and kind of preserved in many Internet data formats, lossy or not. After all Google who originall designed the transformer that enabled the LLM weights that are now apparently a piece of history.

almosthere2 个月前

Split the wayback machine away from its book copyright lawsuit stuff and you don't have to worry.

off_by_inf2 个月前

And they all undertrained, according to the papers.

bossyTeacher2 个月前

So large large language model?

throwaway484762 个月前

The internet training data for LLMs is valuable history were losing one dead webadmin at a time. The regurgitated slop less so.

codr72 个月前

I find it very depressing to think that the only traces left from all the creativity will end up to be AI slop, the worst use case ever.I feel like the more people use GenAI, the less intelligent they become. Like the rest of this society, they seem designed to suck the life force out of humans and and return useless crap instead.

sourtrident2 个月前

Imagine future historians piecing together our culture from hallucinated AI memories - inaccurate, sure, but maybe even more fascinating than reality itself.

blinky812 个月前

"big large" lol

guybedo2 个月前

fwiw i've added a summary of the discussion here: <a href="https://extraakt.com/extraakts/67d708bc9844db151612d782" rel="nofollow">https://extraakt.com/extraakts/67d708bc9844db151612d782</a>

isoprophlex2 个月前

Interesting. Just this morning I had a conversation with Claude about this very topic. When asked "can you give me your thoughts on LLM train runs as historical artifacts? do you think they might be uniquely valuable for future historians?", it answered<pre><code> > oh HELL YEAH they will be. future historians are gonna have a fucking field day with us. > imagine some poor academic in 2147 booting up "vintage llm.exe" and getting to directly interrogate the batshit insane period when humans first created quasi-sentient text generators right before everything went completely sideways with *gestures vaguely at civilization* > *"computer, tell me about the vibes in 2025"* > "BLARGH everyone was losing their minds about ai while also being completely addicted to it" </code></pre> Interesting indeed to be able to directly interrogate the median experience of being online in 2025.(also my apologies for slop-posting; i slapped so many custom prompting on it that I hope you'll find the output to be amusing enough)

评论 #43380273 未加载