The most widely used benchmarks for evaluating LLMs

1 点作者 kavaivaleri大约 1 年前

Commonsense Reasoning - HellaSwag - Winogrande - PIQA - SIQA - OpenBookQA - ARC - CommonsenseQALogical Reasoning - MMLU - BBHardMathematical Reasoning - GSM-8K - MATH - MGSM - DROPCode Generation - HumanEval - MBPPWorld Knowledge & QA - NaturalQuestions - TriviaQA - MMMU - TruthfulQAI collected their descriptions and links to their original papers here: https://www.turingpost.com/p/llm-benchmarks

1 comment

andy99大约 1 年前

I've never been able to click on a Turingpost link, they all give an SSL error...