42 pointsby veryluckyxyz12 months ago

1 comment

jerpint12 months ago

One point they don’t seem to spend much time on is also the difficulty in reproducing outputs in closed-source models. Setting temperature to 0 and setting seeds doesn’t always seem to be enough to get exactly the same results for a given prompt

评论 #40478571 未加载

Lessons from the trenches on reproducible evaluation of language models

1 comment

Lessons from the trenches on reproducible evaluation of language models

1 comment