Current embeddings are badly trained and are massively holding back networks. A core issue is something I call 'token drag'. Low frequency tokens, when they finally come up, drag the model back towards an earlier state causing a lot of lost training. This leads to the first few layers of a model effectively being dedicated to just being a buffer to the bad embeddings feeding the model. Luckily fixing this is actually really easy. Creating a sacrificial two layer network to predict embeddings in training (and then just calculating the embeddings once for prod inference) gives a massive boost to training. To see this in action check out the unified embeddings in this project: <a href="https://github.com/jmward01/lmplay">https://github.com/jmward01/lmplay</a>