TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Understanding Transformers via N-gram Statistics

132 pointsby pona-a2 days ago

6 comments

cschmidt1 day ago
This paper was accepted as a poster to NeurIPS 2024, so it isn&#x27;t just a pre-print. There is a presentation video and slides here:<p><a href="https:&#x2F;&#x2F;neurips.cc&#x2F;virtual&#x2F;2024&#x2F;poster&#x2F;94849" rel="nofollow">https:&#x2F;&#x2F;neurips.cc&#x2F;virtual&#x2F;2024&#x2F;poster&#x2F;94849</a><p>The underlying data has been open sourced as discussed on his blog here <a href="https:&#x2F;&#x2F;timothynguyen.org&#x2F;2024&#x2F;11&#x2F;07&#x2F;open-sourced-my-work-on-llms-and-n-gram-statistics&#x2F;" rel="nofollow">https:&#x2F;&#x2F;timothynguyen.org&#x2F;2024&#x2F;11&#x2F;07&#x2F;open-sourced-my-work-on...</a>
pona-a1 day ago
I wonder if these N-gram reduced models, augmented with confidence measures, can act as a very fast speculative decoder. Or maybe the sheer number of explicit rules unfolded from the compressed latent representation will make it impractical.
评论 #44023533 未加载
评论 #44021099 未加载
montebicyclelo1 day ago
&gt; The results we obtained in Section 7 imply that, at least on simple datasets like TinyStories and Wikipedia, LLM predictions contain much quantifiable structure insofar that they often can be described in terms of our simple statistical rules<p>&gt; we find that for 79% and 68% of LLM next-token distributions on TinyStories and Wikipedia, respectively, their top-1 predictions agree with those provided by our N-gram rulesets<p>Two prediction methods may have completely different mechanisms, but agree sometimes, because they are both predicting the same thing.<p>Seems a fairly large proportion of language can be predicted by a simpler model.. But it&#x27;s the remaining percent that&#x27;s the difficult part; which simple `n-gram` models are bad at, and transformers are really good at.
评论 #44021733 未加载
maz1b2 days ago
How does this have 74 points and only one comment?<p>on topic: couldn&#x27;t one in theory, re-publish this kind of paper for different kinds of LLMs, as the textual corpus upon which LLMs are built based off ultimately, at some level, human effort and human input whether it be writing, or typing?
评论 #44021990 未加载
评论 #44021074 未加载
bilsbie1 day ago
Interesting! Makes me wonder if you could replace transformers with some sort of fancy Markov chain. Maybe with a meta chain that acts as attention.
justanotherjoe2 days ago
Sounds regressive and feeds into the weird unintellectual narrative that llm is just like ngram models (lol, lmao even)<p>Thr author submitted like 10 papers this May alone. Is that weird?
评论 #44019904 未加载
评论 #44019925 未加载