科技回声

We've been testing GPT-4.1 in our investigation system, which is used to triage and debug production incidents.I thought it would be useful to share, as we have evaluation metrics and scorecards for investigations, so you can see how real-world performance compares between models.I've written the post on LinkedIn so I could share a picture of the scorecards and how they compare:https://www.linkedin.com/posts/lawrence2jones_like-many-others-we-were-excited-about-openai-activity-7317907307634323457-FdL7Our takeaways were:- 4.1 is much fussier than Sonnet 3.7 at claiming a code change caused an incident, leading to a drop (38%) in recall- When 4.1 does suggest a PR caused an incident, it's right 33% more than Sonnet 3.7- 4.1 blows 4o out the water, with 4o finding just 3/31 of the code changes in our dataset, showing how much of an upgrade 4.1 is on this taskIn short, 4.1 is a totally different beast to 4o when it comes to software tasks, and at a much lower price-point than Sonnet 3.7 we'll be considering it carefully across our agents.We are also yet to find a metric where 4.1 is worse than 4o, so at minimum this release means >20% cost savings for us.Hopefully useful to people!

Comparing GPT-4.1 against other models in "did code change cause this incident"

暂无评论

Comparing GPT-4.1 against other models in "did code change cause this incident"

暂无评论