TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

What are the best ways of evaluating LLMs for specific use-cases?

7 pointsby saqadrialmost 2 years ago

1 comment

saqadrialmost 2 years ago
We did some research into the different ways of evaluating LLMs, but there is a lot of literature on many different approaches, e.g. scores like BLEURT, precision&#x2F;recall if you have ground truth data, all the way to asking GPT to be a human rater.<p>Are there evaluation strategies that worked best for you? We basically want to allow users&#x2F;developers to evaluate which LLM (and specifically, which LLM + prompts + parameters) has the best performance for <i>their</i> use case (which seems different from the OpenAI evals framework, or benchmarks like BigBench).