We have a use case for doing code conversion (think like SQL to Python), and are looking at other models we can use and fine tune. Does anyone know if there are LLM leaderboard specific for code generation as we won't be using it for generic stuff like creating essays, etc
You're looking for "HumanEval" tests. Not saying this is the best way to test it, but it's the only standard test I know of that code models are compared with and are commonly benchmarked for<p>The current best models you'd want to try that I'm aware of is WizardCoder(15B), Starcoder(15B), and replit's code model(3B). Replit's instruct model is interesting because of it's competitive performance while only being a 3B model so it's the easiest/fastest to use.<p>Perhaps interestingly none of these are based on LLama<p><a href="https://github.com/abacaj/code-eval">https://github.com/abacaj/code-eval</a> - This is a large mostly up to date list of benchmarks<p><a href="https://huggingface.co/WizardLM/WizardCoder-15B-V1.0" rel="nofollow noreferrer">https://huggingface.co/WizardLM/WizardCoder-15B-V1.0</a> - has a chart with a mostly up to date comparison<p>EDIT: License-wise I think you might be able to commercially use Replit's model and Starcoder, I don't think you're allowed to use WizardCoder outside of academic work.
> and are looking at other models we can use and fine tune.<p>This seems to be a common misconception in the industry - fine-tuning a model will almost certainly lower the quality of your response in these situations. You <i>do not</i> want to waste your time fine-tuning on low-quality code that will not exist in-context.<p>You are going to be stuck with off-the-shelf commercially licensed models for now, which will be effectively useless on codebases that extend beyond their fairly limited context (8k tokens, <1k SLOC). It is very likely that the tool you're searching for simply isn't ready yet.