LLM benchmarks are normalized between 0 and 100.<p>The main benchmarks alredy close to 100:
- common sense reasoning (WinoGrande)
- arithmetic (GSM8K)
- multitasking (MMLU)
- sentence completion (HellaSwag)
- common sense reasoning 'challenge' (ARC)<p>The only thing is if the Transformers architecture changes or if there are new benchmarks that measure the performance of models in new properties.<p>What's next? Increasing performance and decreasing token cost has the potential to open up more complex use cases.<p>This will lead to the emergence of LLM processors and models will run entirely locally. This is a likely development scenario.<p>Any thoughts?