When training our 70B model, we sought to accurately evaluate models for natural language understanding and reasoning abilities. Surprisingly, we found that both open and closed models achieve nearly 100% accuracy when evaluated only on unambiguous questions. We cleaned evaluation datasets to isolate true failures of reasoning from failure due to ambiguous or low-quality questions, and have open-sourced many. This includes:<p>• 11 sanitized and extended NLP reasoning benchmarks including ARC, GSM8K, HellaSwag, and Social IQa
• An original code-focused reasoning benchmark
• A new dataset of 450,000 human judgments about ambiguity in NLP questions