TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: DeepSeek V3's AI Code Review Performance – A Reality Check with Data

2 点作者 Jet_Xu5 个月前
I recently conducted a detailed benchmark of various LLMs for AI code review, specifically focusing on Pull Request analysis. The results were quite surprising and contradict some recent marketing claims.<p>Test setup:<p><pre><code> Models tested: Mistral-Large-2411, Gemini 2.0 Flash (thinking-exp-1219), Mistral-Nemo-12B, DeepSeek V3, and &quot;a ReAct AI Agent(also based on Mistral-Nemo-12B)&quot; implementation Consistent testing environment: Same prompts, temperature settings, and max_tokens (8192 for DeepSeek) Test data: Real-world PRs from various open-source projects (all in English) Evaluation: Results were assessed by Claude 3.5 Sonnet V2 for consistency </code></pre> Results (in order of performance):<p><pre><code> 1. Gemini 2.0 Flash (thinking-exp-1219) - it has the best in depth code review result, but the output format could not fulfill requirement perfectly (compare with Mistral AI models) 2. Mistral-Large-2411 3. Mistral-Nemo-12B + ReAct AI Agent 4. DeepSeek V3 5. Mistral-Nemo-12B </code></pre> Key findings:<p><pre><code> - Despite recent marketing claims, DeepSeek V3 only marginally outperformed a 12B model from July - The price-performance ratio is concerning, especially after their February 8th pricing changes - Larger parameter count (671B) didn&#x27;t translate to better PR review quality </code></pre> For transparency: I developed LlamaPReview (&lt;https:&#x2F;&#x2F;jetxu-llm.github.io&#x2F;LlamaPReview-site&#x2F;&gt;), a GitHub App for automated PR reviews, which I used its core code as the testing framework. The app is free and can help you reproduce PR review efforts.<p>Questions for the community:<p><pre><code> 1. Has anyone else noticed similar performance gaps with DeepSeek V3? 2. What metrics should we standardize for comparing LLM performance in specific tasks like code review? 3. How much should marketing claims influence our technical evaluations? </code></pre> Would love to hear your experiences and thoughts, especially from those who&#x27;ve tested multiple models in production environments.

1 comment

homarp5 个月前
&gt;Evaluation: Results were assessed by Claude 3.5 Sonnet V2 for consistency<p>why not doing human assessment on top... to ensure the assessment by Claude is correct?<p>&gt;conducted a detailed benchmark<p>i suggest you post a sample for other to try to reproduce