TechEcho

1 comment

saqadrialmost 2 years ago

We did some research into the different ways of evaluating LLMs, but there is a lot of literature on many different approaches, e.g. scores like BLEURT, precision/recall if you have ground truth data, all the way to asking GPT to be a human rater.<p>Are there evaluation strategies that worked best for you? We basically want to allow users/developers to evaluate which LLM (and specifically, which LLM + prompts + parameters) has the best performance for <i>their</i> use case (which seems different from the OpenAI evals framework, or benchmarks like BigBench).

What are the best ways of evaluating LLMs for specific use-cases?

1 comment

What are the best ways of evaluating LLMs for specific use-cases?

1 comment