TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Open-source, sanitized evaluation datasets for models that reason and code

3 pointsby thejash11 months ago

1 comment

thejash11 months ago
When training our 70B model, we sought to accurately evaluate models for natural language understanding and reasoning abilities. Surprisingly, we found that both open and closed models achieve nearly 100% accuracy when evaluated only on unambiguous questions. We cleaned evaluation datasets to isolate true failures of reasoning from failure due to ambiguous or low-quality questions, and have open-sourced many. This includes:<p>• 11 sanitized and extended NLP reasoning benchmarks including ARC, GSM8K, HellaSwag, and Social IQa • An original code-focused reasoning benchmark • A new dataset of 450,000 human judgments about ambiguity in NLP questions