TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: LLM Leaderboard for Code Generation?

1 点作者 palidanx将近 2 年前
We have a use case for doing code conversion (think like SQL to Python), and are looking at other models we can use and fine tune. Does anyone know if there are LLM leaderboard specific for code generation as we won't be using it for generic stuff like creating essays, etc

2 条评论

thewataccount将近 2 年前
You&#x27;re looking for &quot;HumanEval&quot; tests. Not saying this is the best way to test it, but it&#x27;s the only standard test I know of that code models are compared with and are commonly benchmarked for<p>The current best models you&#x27;d want to try that I&#x27;m aware of is WizardCoder(15B), Starcoder(15B), and replit&#x27;s code model(3B). Replit&#x27;s instruct model is interesting because of it&#x27;s competitive performance while only being a 3B model so it&#x27;s the easiest&#x2F;fastest to use.<p>Perhaps interestingly none of these are based on LLama<p><a href="https:&#x2F;&#x2F;github.com&#x2F;abacaj&#x2F;code-eval">https:&#x2F;&#x2F;github.com&#x2F;abacaj&#x2F;code-eval</a> - This is a large mostly up to date list of benchmarks<p><a href="https:&#x2F;&#x2F;huggingface.co&#x2F;WizardLM&#x2F;WizardCoder-15B-V1.0" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;WizardLM&#x2F;WizardCoder-15B-V1.0</a> - has a chart with a mostly up to date comparison<p>EDIT: License-wise I think you might be able to commercially use Replit&#x27;s model and Starcoder, I don&#x27;t think you&#x27;re allowed to use WizardCoder outside of academic work.
smoldesu将近 2 年前
&gt; and are looking at other models we can use and fine tune.<p>This seems to be a common misconception in the industry - fine-tuning a model will almost certainly lower the quality of your response in these situations. You <i>do not</i> want to waste your time fine-tuning on low-quality code that will not exist in-context.<p>You are going to be stuck with off-the-shelf commercially licensed models for now, which will be effectively useless on codebases that extend beyond their fairly limited context (8k tokens, &lt;1k SLOC). It is very likely that the tool you&#x27;re searching for simply isn&#x27;t ready yet.