TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

A statistical approach to model evaluations

66 点作者 RobinHirst116 个月前

4 条评论

fnordpiglet6 个月前
This does feel a bit like under grad introduction to statistical analysis and surprising anyone felt the need to explain these things. But I also suspect most AI people out there now a days have limited math skills so maybe it’s helpful?
评论 #42277246 未加载
评论 #42277857 未加载
评论 #42278727 未加载
评论 #42277130 未加载
评论 #42282749 未加载
Unlisted64466 个月前
All things considered, although I&#x27;m in favor of Anthropic&#x27;s suggestions, I&#x27;m surprised that they&#x27;re not recommending more (nominally) advanced statistical methods. I wonder if this is because more advanced methods don&#x27;t have any benefits or if they don&#x27;t want to overwhelm the ML community.<p>For one, they could consider using equivalence testing for comparing models, instead of significance testing. I&#x27;d be surprised if their significance tests were not significant given 10000 eval questions and I don&#x27;t see why they couldn&#x27;t ask the competing models 10000 eval questions?<p>My intuition is that multilevel modelling could help with the clustered standard errors, but I&#x27;ll assume that they know what they&#x27;re doing.
评论 #42289927 未加载
ipunchghosts6 个月前
I have been promoting this and saying it since at least 2018. You can see my publication record as evidence!!!<p>&quot;Random seed xxx is all you need&quot; was another demonstration of this need.<p>You actually want a wilcoxon sum rank test as many metrics are not gaussian especially as they get to thier limits!! I.e. accuracy roughly 99 or 100! Then it becomes highly sub gaussian.
intended6 个月前
Since when the heck did evals change what they referred to. Evals were what you did to check if the output of a model was correct. What happened ?