TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Automatically Detecting Under-Trained Tokens in Large Language Models

182 点作者 veryluckyxyz大约 1 年前

6 条评论

helsinkiandrew大约 1 年前
Good Computerphile video on glitch tokens a year ago:<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=WO2X3oZEJOA" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=WO2X3oZEJOA</a>
评论 #40334128 未加载
londons_explore大约 1 年前
We shouldn&#x27;t just be looking for under trained tokens. Tokens are effectively the first layer of the network, but we should also be looking for training data imbalances at every weight at every other layer of the network.<p>When we find them, it might be best to delete weights with hardly any data flowing through them (which might make the model smaller or help generalisation).
评论 #40358515 未加载
评论 #40339541 未加载
评论 #40337259 未加载
65a大约 1 年前
I find it hard to believe that a Canadian company&#x27;s model contained an undertrained token related to hockey (albeit in German). In all seriousness, this is pretty cool and am excited to see understanding of tokenization impacts on models improve. One notable finding is that a lot of the earlier open source models have issues with carriage returns, which are not that uncommonly introduced depending on where the data is coming from etc.
esafak大约 1 年前
There is a random matrix theory derived diagnostic of training that relies on the spectral density of the correlation matrix of the weights. Each layer&#x27;s spectral density is fit to a truncated power law, and deemed properly trained if the power law exponent alpha is just above two.<p><a href="https:&#x2F;&#x2F;jmlr.org&#x2F;beta&#x2F;papers&#x2F;v22&#x2F;20-410.html" rel="nofollow">https:&#x2F;&#x2F;jmlr.org&#x2F;beta&#x2F;papers&#x2F;v22&#x2F;20-410.html</a>
anewhnaccount3大约 1 年前
Isn&#x27;t the solution to just train the tokeniser on the same corpus as the LLM? I&#x27;m not sure why reusing tokenisers is so common. Anybody know?
评论 #40335713 未加载
评论 #40334903 未加载
评论 #40333765 未加载
评论 #40334655 未加载
评论 #40333852 未加载
djamconway大约 1 年前
Amazing name for the paper
评论 #40334751 未加载