TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Comparing GPT-4.1 against other models in "did code change cause this incident"

1 pointsby lawrjoneabout 1 month ago
We&#x27;ve been testing GPT-4.1 in our investigation system, which is used to triage and debug production incidents.<p>I thought it would be useful to share, as we have evaluation metrics and scorecards for investigations, so you can see how real-world performance compares between models.<p>I&#x27;ve written the post on LinkedIn so I could share a picture of the scorecards and how they compare:<p>https:&#x2F;&#x2F;www.linkedin.com&#x2F;posts&#x2F;lawrence2jones_like-many-others-we-were-excited-about-openai-activity-7317907307634323457-FdL7<p>Our takeaways were:<p>- 4.1 is much fussier than Sonnet 3.7 at claiming a code change caused an incident, leading to a drop (38%) in recall<p>- When 4.1 does suggest a PR caused an incident, it&#x27;s right 33% more than Sonnet 3.7<p>- 4.1 blows 4o out the water, with 4o finding just 3&#x2F;31 of the code changes in our dataset, showing how much of an upgrade 4.1 is on this task<p>In short, 4.1 is a totally different beast to 4o when it comes to software tasks, and at a much lower price-point than Sonnet 3.7 we&#x27;ll be considering it carefully across our agents.<p>We are also yet to find a metric where 4.1 is worse than 4o, so at minimum this release means &gt;20% cost savings for us.<p>Hopefully useful to people!

no comments

no comments