科技回声

6 条评论

Good Computerphile video on glitch tokens a year ago:<p><a href="https://www.youtube.com/watch?v=WO2X3oZEJOA" rel="nofollow">https://www.youtube.com/watch?v=WO2X3oZEJOA</a>

评论 #40334128 未加载

londons_explore大约 1 年前

We shouldn't just be looking for under trained tokens. Tokens are effectively the first layer of the network, but we should also be looking for training data imbalances at every weight at every other layer of the network.<p>When we find them, it might be best to delete weights with hardly any data flowing through them (which might make the model smaller or help generalisation).

评论 #40358515 未加载

评论 #40339541 未加载

评论 #40337259 未加载

65a大约 1 年前

I find it hard to believe that a Canadian company's model contained an undertrained token related to hockey (albeit in German). In all seriousness, this is pretty cool and am excited to see understanding of tokenization impacts on models improve. One notable finding is that a lot of the earlier open source models have issues with carriage returns, which are not that uncommonly introduced depending on where the data is coming from etc.

esafak大约 1 年前

There is a random matrix theory derived diagnostic of training that relies on the spectral density of the correlation matrix of the weights. Each layer's spectral density is fit to a truncated power law, and deemed properly trained if the power law exponent alpha is just above two.<p><a href="https://jmlr.org/beta/papers/v22/20-410.html" rel="nofollow">https://jmlr.org/beta/papers/v22/20-410.html</a>

anewhnaccount3大约 1 年前

Isn't the solution to just train the tokeniser on the same corpus as the LLM? I'm not sure why reusing tokenisers is so common. Anybody know?

评论 #40335713 未加载

评论 #40334903 未加载

评论 #40333765 未加载

评论 #40334655 未加载

评论 #40333852 未加载

djamconway大约 1 年前

Amazing name for the paper

评论 #40334751 未加载

6 条评论

helsinkiandrew大约 1 年前

Good Computerphile video on glitch tokens a year ago:<p><a href="https://www.youtube.com/watch?v=WO2X3oZEJOA" rel="nofollow">https://www.youtube.com/watch?v=WO2X3oZEJOA</a>

评论 #40334128 未加载

londons_explore大约 1 年前

评论 #40358515 未加载

评论 #40339541 未加载

评论 #40337259 未加载

65a大约 1 年前

esafak大约 1 年前

anewhnaccount3大约 1 年前

Isn't the solution to just train the tokeniser on the same corpus as the LLM? I'm not sure why reusing tokenisers is so common. Anybody know?

Automatically Detecting Under-Trained Tokens in Large Language Models

6 条评论

Automatically Detecting Under-Trained Tokens in Large Language Models

6 条评论