TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Comparing GPT-4.1 against other models in "did code change cause this incident"

1 点作者 lawrjone29 天前
We&#x27;ve been testing GPT-4.1 in our investigation system, which is used to triage and debug production incidents.<p>I thought it would be useful to share, as we have evaluation metrics and scorecards for investigations, so you can see how real-world performance compares between models.<p>I&#x27;ve written the post on LinkedIn so I could share a picture of the scorecards and how they compare:<p>https:&#x2F;&#x2F;www.linkedin.com&#x2F;posts&#x2F;lawrence2jones_like-many-others-we-were-excited-about-openai-activity-7317907307634323457-FdL7<p>Our takeaways were:<p>- 4.1 is much fussier than Sonnet 3.7 at claiming a code change caused an incident, leading to a drop (38%) in recall<p>- When 4.1 does suggest a PR caused an incident, it&#x27;s right 33% more than Sonnet 3.7<p>- 4.1 blows 4o out the water, with 4o finding just 3&#x2F;31 of the code changes in our dataset, showing how much of an upgrade 4.1 is on this task<p>In short, 4.1 is a totally different beast to 4o when it comes to software tasks, and at a much lower price-point than Sonnet 3.7 we&#x27;ll be considering it carefully across our agents.<p>We are also yet to find a metric where 4.1 is worse than 4o, so at minimum this release means &gt;20% cost savings for us.<p>Hopefully useful to people!

暂无评论

暂无评论