It's interesting that with direct prompting there's only a 1 point difference between GPT-4o and GPT-4 turbo, but with the AlphaCodium flow it becomes a substantial 6 point difference.<p>AlphaCodium works by decomposing a competitive programming problem into simple steps and has an automated flow that uses the LLM for each step. It's iterative, so compilation errors and test results are fed back into the model and the model can fix mistakes.<p>IMO this is a much more useful benchmark than a typical eval because it reflects how LLMs are actually used in the real world. It seems like it also surfaces subtle differences in reasoning abilities that zero-shot evals don't capture.