This does feel a bit like under grad introduction to statistical analysis and surprising anyone felt the need to explain these things. But I also suspect most AI people out there now a days have limited math skills so maybe it’s helpful?
All things considered, although I'm in favor of Anthropic's suggestions, I'm surprised that they're not recommending more (nominally) advanced statistical methods. I wonder if this is because more advanced methods don't have any benefits or if they don't want to overwhelm the ML community.<p>For one, they could consider using equivalence testing for comparing models, instead of significance testing. I'd be surprised if their significance tests were not significant given 10000 eval questions and I don't see why they couldn't ask the competing models 10000 eval questions?<p>My intuition is that multilevel modelling could help with the clustered standard errors, but I'll assume that they know what they're doing.
I have been promoting this and saying it since at least 2018. You can see my publication record as evidence!!!<p>"Random seed xxx is all you need" was another demonstration of this need.<p>You actually want a wilcoxon sum rank test as many metrics are not gaussian especially as they get to thier limits!! I.e. accuracy roughly 99 or 100! Then it becomes highly sub gaussian.