Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:<p>* run LLM evaluations systematically and at scale<p>* share the data with the public in a rigorous and transparent way<p>We use the UK government's Inspect [1] library to run the evaluations.<p>As soon as I saw this news on HN, I evaluated Mistral Small 3 on MATH [2] level 5 (hardest subset, 1,324 questions). I get an accuracy of 0.45 (± 0.011). We sample the LLM 8 times for each question, which lets us obtain less noisy estimates of mean accuracy, and measure the consistency of the LLM's answers. The 1,324*8=10,584 samples represent 8.5M tokens (2M in, 6.5M out).<p>You can see the full transcripts here in Inspect’s interactive interface: <a href="https://epoch.ai/inspect-viewer/484131e0/viewer?log_file=https%3A%2F%2Fepoch-benchmarks-production-public.s3.us-east-2.amazonaws.com%2Finspect_ai_logs%2FNbsnvBsMoMizozbPZY8LLb.eval" rel="nofollow">https://epoch.ai/inspect-viewer/484131e0/viewer?log_file=htt...</a><p>Note that MATH is a different benchmark from the MathInstruct [3] mentioned in the OP.<p>It's still early days for Epoch AI's benchmarking work. I'm developing a systematic database of evaluations run directly by us (so we can share the full details transparently), which we hope to release very soon.<p>[0]: <a href="https://epoch.ai/" rel="nofollow">https://epoch.ai/</a><p>[1]: <a href="https://github.com/UKGovernmentBEIS/inspect_ai">https://github.com/UKGovernmentBEIS/inspect_ai</a><p>[2]: <a href="https://arxiv.org/abs/2103.03874" rel="nofollow">https://arxiv.org/abs/2103.03874</a><p>[3]: <a href="https://huggingface.co/datasets/TIGER-Lab/MathInstruct" rel="nofollow">https://huggingface.co/datasets/TIGER-Lab/MathInstruct</a>