TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Evaluating 55 LLMs with GPT-4

36 pointsby vinceltover 1 year ago

7 comments

bradknowlesover 1 year ago
How is this benchmark not inherently biased towards GPT?<p>If I did the same sort of thing but used Claude to grade the tests, would I get similar results? Or would that be inherently biased towards Claude scoring high?
crashocasterover 1 year ago
I always find evals of this flavor offputting given that 3.5 and 4 likely share preference models (or at least feedback data)
habitueover 1 year ago
Should be evaluating each prompt multiple times to see how much variance in the scores there are. Even gpt-4 grading gpt-4 should probably be done several times
natsucksover 1 year ago
Why no multi-turn evaluation? A lot of these benchmarks fail to capture the strength of ghost attention used in Llama 2 chat models.
aiunboxedover 1 year ago
Any reason why palm or cohere models are not here ?
评论 #37821646 未加载
londons_exploreover 1 year ago
GPT-4-0314 is top of the league table (ie. Not the latest version, but the version released in March).<p>Is this our Concorde moment?
ionwakeover 1 year ago
Really cool thanks