We shouldn't just be looking for under trained tokens. Tokens are effectively the first layer of the network, but we should also be looking for training data imbalances at every weight at every other layer of the network.<p>When we find them, it might be best to delete weights with hardly any data flowing through them (which might make the model smaller or help generalisation).