Not sure if this is even possible, but is there any site or benchmark for testing which AI model is best for the task of coding?<p>Like Claude 3.5 vs GPT 4o vs Gemini 2 etc<p>What exists beyond our opinions to more objectively measure the quality of code output on these models?
The two that I know of are SWE-bench and CodeElo. SWE-bench is oriented towards "real world" performance (resolution of GitHub issues), and CodeElo is oriented towards competitive programming (CodeForces).<p><a href="https://www.swebench.com/" rel="nofollow">https://www.swebench.com/</a><p><a href="https://codeelo-bench.github.io/" rel="nofollow">https://codeelo-bench.github.io/</a>
As far as I know we don't have any way to objectively measure or compare "quality" or "coding performance" or "best" when looking at code produced by human programmers.<p>You may find this useful:<p><a href="https://www.gitclear.com/coding_on_copilot_data_shows_ais_downward_pressure_on_code_quality" rel="nofollow">https://www.gitclear.com/coding_on_copilot_data_shows_ais_do...</a><p>Or this analysis if you don't want to sign up to download that white paper:<p><a href="https://arc.dev/talent-blog/impact-of-ai-on-code/" rel="nofollow">https://arc.dev/talent-blog/impact-of-ai-on-code/</a>