科技回声

1 comment

GavCo大约 1 年前

It's interesting that with direct prompting there's only a 1 point difference between GPT-4o and GPT-4 turbo, but with the AlphaCodium flow it becomes a substantial 6 point difference.<p>AlphaCodium works by decomposing a competitive programming problem into simple steps and has an automated flow that uses the LLM for each step. It's iterative, so compilation errors and test results are fed back into the model and the model can fix mistakes.<p>IMO this is a much more useful benchmark than a typical eval because it reflects how LLMs are actually used in the real world. It seems like it also surfaces subtle differences in reasoning abilities that zero-shot evals don't capture.

评论 #40396890 未加载

GPT-4o hit 54% accuracy on CodeContests with AlphaCodium, up from 48% for GPT-4T

1 comment

GPT-4o hit 54% accuracy on CodeContests with AlphaCodium, up from 48% for GPT-4T

1 comment