I recall reading recently that someone went back and trained an RNN at a similar scale to a GPT and got similar performance on modern hardware (perhaps someone can link me that paper?).<p>ie., the innovation in statistical AI isn't in making the algorithms "smarter", it's finding ways to align the computation with modern GPU hardware -- this has been the story since 2012.<p>In the end, the function all such algs are approximating is a conditional probability. ie., the perfect answer to any prompt is to ignore training entirely, and at inference time, compute an expectation across all historical data. All training does is essentially optimally cache a large part of that computation.<p>This is very different to how it's typically sold/understood, in the sense that there's an appearance that at inference-time some unbounded computation is going on, ie., "thinking"/"reasoning"/etc. But at inference time <i>for any prompt</i> the same amount of computation is used, regardless of the question complexity. So the system will appear to reason (etc.) if it can sample convincingly from its pre-cached computation.<p>This means "innovation" here follows a moore's law S-curve for GPU hardware.