The author of the article appears to have misunderstood one important detail about Code Llama.<p>They state:<p><i>> The Code Llama models were trained on 500B tokens, whereas Llama 2 models were trained on 2T tokens. Since the Code Llama model was trained on 4x fewer tokens, maybe a CodeLlama 70B version did not perform well enough due to LLM scaling laws—there was not enough training data.</i><p>But if you read the paper, on page 1, it says:<p><i>> Our approach is based on gradually specializing and increasing the capabilities of Llama 2 models by applying a cascade of training and fine-tuning steps [...]</i><p>In fact, they show a diagram at the top of page 3 that details the process, starting with Llama 2 foundation models.<p>Llama 2 Foundation models (7B, 13B, 34B) -> Code training 500B -> Python / Long Context.<p>See the paper here:
<a href="https://arxiv.org/abs/2308.12950" rel="nofollow noreferrer">https://arxiv.org/abs/2308.12950</a>