I recently conducted a detailed benchmark of various LLMs for AI code review, specifically focusing on Pull Request analysis. The results were quite surprising and contradict some recent marketing claims.<p>Test setup:<p><pre><code> Models tested: Mistral-Large-2411, Gemini 2.0 Flash (thinking-exp-1219), Mistral-Nemo-12B, DeepSeek V3, and "a ReAct AI Agent(also based on Mistral-Nemo-12B)" implementation
Consistent testing environment: Same prompts, temperature settings, and max_tokens (8192 for DeepSeek)
Test data: Real-world PRs from various open-source projects (all in English)
Evaluation: Results were assessed by Claude 3.5 Sonnet V2 for consistency
</code></pre>
Results (in order of performance):<p><pre><code> 1. Gemini 2.0 Flash (thinking-exp-1219) - it has the best in depth code review result, but the output format could not fulfill requirement perfectly (compare with Mistral AI models)
2. Mistral-Large-2411
3. Mistral-Nemo-12B + ReAct AI Agent
4. DeepSeek V3
5. Mistral-Nemo-12B
</code></pre>
Key findings:<p><pre><code> - Despite recent marketing claims, DeepSeek V3 only marginally outperformed a 12B model from July
- The price-performance ratio is concerning, especially after their February 8th pricing changes
- Larger parameter count (671B) didn't translate to better PR review quality
</code></pre>
For transparency: I developed LlamaPReview (<https://jetxu-llm.github.io/LlamaPReview-site/>), a GitHub App for automated PR reviews, which I used its core code as the testing framework. The app is free and can help you reproduce PR review efforts.<p>Questions for the community:<p><pre><code> 1. Has anyone else noticed similar performance gaps with DeepSeek V3?
2. What metrics should we standardize for comparing LLM performance in specific tasks like code review?
3. How much should marketing claims influence our technical evaluations?
</code></pre>
Would love to hear your experiences and thoughts, especially from those who've tested multiple models in production environments.