This paper was accepted as a poster to NeurIPS 2024, so it isn't just a pre-print. There is a presentation video and slides here:<p><a href="https://neurips.cc/virtual/2024/poster/94849" rel="nofollow">https://neurips.cc/virtual/2024/poster/94849</a><p>The underlying data has been open sourced as discussed on his blog here <a href="https://timothynguyen.org/2024/11/07/open-sourced-my-work-on-llms-and-n-gram-statistics/" rel="nofollow">https://timothynguyen.org/2024/11/07/open-sourced-my-work-on...</a>
I wonder if these N-gram reduced models, augmented with confidence measures, can act as a very fast speculative decoder. Or maybe the sheer number of explicit rules unfolded from the compressed latent representation will make it impractical.
> The results we obtained in Section 7 imply that, at least on simple datasets like TinyStories and Wikipedia, LLM predictions contain much quantifiable structure insofar that they often can be described in terms of our simple statistical rules<p>> we find that for 79% and 68% of LLM next-token distributions on TinyStories and Wikipedia, respectively, their top-1 predictions agree with those provided by our N-gram rulesets<p>Two prediction methods may have completely different mechanisms, but agree sometimes, because they are both predicting the same thing.<p>Seems a fairly large proportion of language can be predicted by a simpler model.. But it's the remaining percent that's the difficult part; which simple `n-gram` models are bad at, and transformers are really good at.
How does this have 74 points and only one comment?<p>on topic: couldn't one in theory, re-publish this kind of paper for different kinds of LLMs, as the textual corpus upon which LLMs are built based off ultimately, at some level, human effort and human input whether it be writing, or typing?
Sounds regressive and feeds into the weird unintellectual narrative that llm is just like ngram models (lol, lmao even)<p>Thr author submitted like 10 papers this May alone. Is that weird?