In all the debate about open source I don’t think people realize, this model is most likely not reproducible ever again even given the code. Here’s what you need to reproduce the model:<p>1. An exact snapshot of the data used, many companies don’t have this, you have rough dataset versions but remember if even 1 token is different, the model produced won’t be the same.<p>2. Data must be sent to the training algorithm in the exact same order as it was originally. So every data loader needs to be with a fixed random seed.<p>3. All the probabilistic parts of your model needs to have a fixed random seed. Here I’m thinking of stuff like dropout and for autoregressive models you might be sampling your previous output, you have to ensure they are properly seeded. Generally you do see fixed seeds in academic papers but it’s easy to miss stuff especially in distributed training jobs.<p>4. Here’s another interesting thing, you start your training job on 1000 GPUs and then suddenly 4 GPUs fail. What do you do? There might be deterministic ways to solve this but the standard approach is to discard all updates that that GPU was going to do and restart that GPU from scratch. You can see why this is a problem? Now if you want to reproduce this training you need to disable those GPU at the same time in the new training job to make this work.<p>I suspect there are even more things I didn’t think of that will make this model unique and irreproducible by training for eternity, almost like a human brain?<p>In fact the notion of exact reproducibility in the world of LLMs is silly, there is only approximate reproducibility, (models with similar scores in benchmarks) but nothing exact. That said I can see the value of releasing source code but I’m completely fine with grok not releasing it. Source code can reveal tricks that have not been published in papers yet that a company discovered to improve their model. Seeing the performance of Grok, I’m pretty confident there isn’t any great tricks to be found in their code so I don’t really care, I would be pretty curious about OpenAI’s or Anthropic’s source code though.