We did some research into the different ways of evaluating LLMs, but there is a lot of literature on many different approaches, e.g. scores like BLEURT, precision/recall if you have ground truth data, all the way to asking GPT to be a human rater.<p>Are there evaluation strategies that worked best for you? We basically want to allow users/developers to evaluate which LLM (and specifically, which LLM + prompts + parameters) has the best performance for <i>their</i> use case (which seems different from the OpenAI evals framework, or benchmarks like BigBench).