Most LLM evaluation methods are based on a defined set of criteria that many can be optimized for, without necessarily meaning better performance. Much like a student who performs well at memorizing exam questions, but performs poorly 'in the field'. This issue has even been debated in so called 'standardized tests'.<p>Representing a more complex landscape with domain and task specific agents, we should rather develop a set of tests on 'activities' measuring things such as goal completion and competitions for these solutions much like human based coding competitions or debate clubs with human judges with subject-matter expertise.<p>Also, tested on things like adherence to examples given in the prompt (I.e. few shot) and their contextual usage. As fine tuned open sourced models grow more complex we should move away from parameter count and a default set of criteria to testing these LLMs similarly to as we would hiring a team member for a specific set of functions in a given industry.