Throwaway account here. I recently spent a few months as a trainer for a major AI company's project. The well-paid gig mainly involved crafting specialized, reasoning-heavy questions that were supposed to stump the current top models. Most of the trainers had PhDs, and the company's idea was to use our questions to benchmark future AI systems.<p>It was a real challenge. I managed to come up with a handful of questions that tripped up the models, but it was clear they stumbled for pretty mundane reasons—outdated info or faulty string parsing due to tokenization. A common gripe among the trainers was the project's insistence on questions with clear-cut right/wrong answers. Many of us worked in fields where good research tends to be more nuanced and open to interpretation. I saw plenty of questions from other trainers that only had definitive answers if you bought into specific (and often contentious) theoretical frameworks in psychology, sociology, linguistics, history, and so on.<p>The AI company people running the projects seemed a bit out of their depth, too. Their detailed guidelines for us actually contained some fundamental contradictions that they had missed. (Ironically, when I ran those guidelines by Claude, ChatGPT, and Gemini, they all spotted the issues straight away.)<p>After finishing the project, I came away even more impressed by how smart the current models can be.