TechEcho

6 comments

Good Computerphile video on glitch tokens a year ago:<p><a href="https://www.youtube.com/watch?v=WO2X3oZEJOA" rel="nofollow">https://www.youtube.com/watch?v=WO2X3oZEJOA</a>

评论 #40334128 未加载

londons_exploreabout 1 year ago

We shouldn't just be looking for under trained tokens. Tokens are effectively the first layer of the network, but we should also be looking for training data imbalances at every weight at every other layer of the network.<p>When we find them, it might be best to delete weights with hardly any data flowing through them (which might make the model smaller or help generalisation).

评论 #40358515 未加载

评论 #40339541 未加载

评论 #40337259 未加载

65aabout 1 year ago

I find it hard to believe that a Canadian company's model contained an undertrained token related to hockey (albeit in German). In all seriousness, this is pretty cool and am excited to see understanding of tokenization impacts on models improve. One notable finding is that a lot of the earlier open source models have issues with carriage returns, which are not that uncommonly introduced depending on where the data is coming from etc.

esafakabout 1 year ago

There is a random matrix theory derived diagnostic of training that relies on the spectral density of the correlation matrix of the weights. Each layer's spectral density is fit to a truncated power law, and deemed properly trained if the power law exponent alpha is just above two.<p><a href="https://jmlr.org/beta/papers/v22/20-410.html" rel="nofollow">https://jmlr.org/beta/papers/v22/20-410.html</a>

anewhnaccount3about 1 year ago

Isn't the solution to just train the tokeniser on the same corpus as the LLM? I'm not sure why reusing tokenisers is so common. Anybody know?

评论 #40335713 未加载

评论 #40334903 未加载

评论 #40333765 未加载

评论 #40334655 未加载

评论 #40333852 未加载

djamconwayabout 1 year ago

Amazing name for the paper

评论 #40334751 未加载

6 comments

helsinkiandrewabout 1 year ago

Good Computerphile video on glitch tokens a year ago:<p><a href="https://www.youtube.com/watch?v=WO2X3oZEJOA" rel="nofollow">https://www.youtube.com/watch?v=WO2X3oZEJOA</a>

评论 #40334128 未加载

londons_exploreabout 1 year ago

评论 #40358515 未加载

评论 #40339541 未加载

评论 #40337259 未加载

65aabout 1 year ago

esafakabout 1 year ago

anewhnaccount3about 1 year ago

Isn't the solution to just train the tokeniser on the same corpus as the LLM? I'm not sure why reusing tokenisers is so common. Anybody know?

Automatically Detecting Under-Trained Tokens in Large Language Models

6 comments

Automatically Detecting Under-Trained Tokens in Large Language Models

6 comments