I have a question for those who deeply understand LLMs.<p>From what I understand, the leap from GPT-2 to GPT-3 was mostly about scaling - more compute, more data. GPT-3 to 4 probably followed the same path.<p>But in the year and a half since GPT-4, LLMs have gotten significantly better, especially the smaller ones. I'm consistently impressed by models like Claude 3.5 Sonnet, despite us supposedly reaching scaling limits.<p>What's driving these improvements? Is it thousands of small optimizations in data cleaning, training, and prompting? Or am I just deep enough in tech now that I'm noticing subtle changes more?
Really curious to hear from people who understand the technical internals here.