Good Computerphile video on glitch tokens a year ago:<p><a href="https://www.youtube.com/watch?v=WO2X3oZEJOA" rel="nofollow">https://www.youtube.com/watch?v=WO2X3oZEJOA</a>
We shouldn't just be looking for under trained tokens. Tokens are effectively the first layer of the network, but we should also be looking for training data imbalances at every weight at every other layer of the network.<p>When we find them, it might be best to delete weights with hardly any data flowing through them (which might make the model smaller or help generalisation).
I find it hard to believe that a Canadian company's model contained an undertrained token related to hockey (albeit in German). In all seriousness, this is pretty cool and am excited to see understanding of tokenization impacts on models improve. One notable finding is that a lot of the earlier open source models have issues with carriage returns, which are not that uncommonly introduced depending on where the data is coming from etc.
There is a random matrix theory derived diagnostic of training that relies on the spectral density of the correlation matrix of the weights. Each layer's spectral density is fit to a truncated power law, and deemed properly trained if the power law exponent alpha is just above two.<p><a href="https://jmlr.org/beta/papers/v22/20-410.html" rel="nofollow">https://jmlr.org/beta/papers/v22/20-410.html</a>
Isn't the solution to just train the tokeniser on the same corpus as the LLM? I'm not sure why reusing tokenisers is so common. Anybody know?