Hey HN! Chinchilla (DeepMind 2022) tells us that when we scale up our language model training, we should scale the parameters and data equally.<p>Over the last several months I've been hacking on a research project to determine if the optimal compute allocation (scaling law) for training an LLM is sensitive to training data complexity. I found that as data complexity increases, you need even more data than Chinchilla suggests!<p>I released the preprint just yesterday: <a href="https://arxiv.org/abs/2405.16684" rel="nofollow">https://arxiv.org/abs/2405.16684</a>