TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Automatically Detecting Under-Trained Tokens in Large Language Models

182 pointsby veryluckyxyzabout 1 year ago

6 comments

helsinkiandrewabout 1 year ago
Good Computerphile video on glitch tokens a year ago:<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=WO2X3oZEJOA" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=WO2X3oZEJOA</a>
评论 #40334128 未加载
londons_exploreabout 1 year ago
We shouldn&#x27;t just be looking for under trained tokens. Tokens are effectively the first layer of the network, but we should also be looking for training data imbalances at every weight at every other layer of the network.<p>When we find them, it might be best to delete weights with hardly any data flowing through them (which might make the model smaller or help generalisation).
评论 #40358515 未加载
评论 #40339541 未加载
评论 #40337259 未加载
65aabout 1 year ago
I find it hard to believe that a Canadian company&#x27;s model contained an undertrained token related to hockey (albeit in German). In all seriousness, this is pretty cool and am excited to see understanding of tokenization impacts on models improve. One notable finding is that a lot of the earlier open source models have issues with carriage returns, which are not that uncommonly introduced depending on where the data is coming from etc.
esafakabout 1 year ago
There is a random matrix theory derived diagnostic of training that relies on the spectral density of the correlation matrix of the weights. Each layer&#x27;s spectral density is fit to a truncated power law, and deemed properly trained if the power law exponent alpha is just above two.<p><a href="https:&#x2F;&#x2F;jmlr.org&#x2F;beta&#x2F;papers&#x2F;v22&#x2F;20-410.html" rel="nofollow">https:&#x2F;&#x2F;jmlr.org&#x2F;beta&#x2F;papers&#x2F;v22&#x2F;20-410.html</a>
anewhnaccount3about 1 year ago
Isn&#x27;t the solution to just train the tokeniser on the same corpus as the LLM? I&#x27;m not sure why reusing tokenisers is so common. Anybody know?
评论 #40335713 未加载
评论 #40334903 未加载
评论 #40333765 未加载
评论 #40334655 未加载
评论 #40333852 未加载
djamconwayabout 1 year ago
Amazing name for the paper
评论 #40334751 未加载