42 点作者 veryluckyxyz12 个月前

1 comment

jerpint12 个月前

One point they don’t seem to spend much time on is also the difficulty in reproducing outputs in closed-source models. Setting temperature to 0 and setting seeds doesn’t always seem to be enough to get exactly the same results for a given prompt

评论 #40478571 未加载

Lessons from the trenches on reproducible evaluation of language models

1 comment

Lessons from the trenches on reproducible evaluation of language models

1 comment