The author of the article appears to have misunderstood one important detail about Code Llama.<p>They state:<p><i>> The Code Llama models were trained on 500B tokens, whereas Llama 2 models were trained on 2T tokens. Since the Code Llama model was trained on 4x fewer tokens, maybe a CodeLlama 70B version did not perform well enough due to LLM scaling laws—there was not enough training data.</i><p>But if you read the paper, on page 1, it says:<p><i>> Our approach is based on gradually specializing and increasing the capabilities of Llama 2 models by applying a cascade of training and fine-tuning steps [...]</i><p>In fact, they show a diagram at the top of page 3 that details the process, starting with Llama 2 foundation models.<p>Llama 2 Foundation models (7B, 13B, 34B) -> Code training 500B -> Python / Long Context.<p>See the paper here:
<a href="https://arxiv.org/abs/2308.12950" rel="nofollow noreferrer">https://arxiv.org/abs/2308.12950</a>
>GPT-3.5 has 175B parameters versus 70B parameters in Llama 2<p>We know that for the original version of GPT-3.5, but my assumption was that Turbo was a distilled smaller model (which is why it uses OAI's new vocab & is so much faster).<p>If that's not the case, what could be the explanation for it being faster?
I'm not sure if I'm the only one, but I find the starcoder model to be muuuuch better than codellama 34B quantized. I can't seem to find any good coding benchmarks online comparing the two.<p>Anyone else having a similar experience?