This post is misleading, in a way that is hard to do accidentally.<p><pre><code> - They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
- They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2].
- For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
- Starcoder, when prompted properly, scores 40% on humaneval [4]
- They do not report their base model performance (as far as I can tell)
</code></pre>
This is interesting work, and a good contribution, but it's important to compare similar models.<p>[1] <a href="https://github.com/nlpxucan/WizardLM">https://github.com/nlpxucan/WizardLM</a><p>[2] <a href="https://huggingface.co/vikp/llama_coder" rel="nofollow noreferrer">https://huggingface.co/vikp/llama_coder</a><p>[3] <a href="https://stability.ai/blog/stablecode-llm-generative-ai-coding" rel="nofollow noreferrer">https://stability.ai/blog/stablecode-llm-generative-ai-codin...</a><p>[4] <a href="https://github.com/huggingface/blog/blob/main/starcoder.md">https://github.com/huggingface/blog/blob/main/starcoder.md</a>