My theory is that architecture doesn't matter - convolutional, transformer or recurrent, as long as you can efficiently train models of the same size, what counts is the dataset.<p>Similarly, humans achieve about the same results when they have the same training. Small variations. What matters is not the brain but the education they get.<p>Of course I am exaggerating a bit, just saying there are a multitude of architectures of brain and neural nets with similar abilities, and the differentiating factor is the data not the model.<p>For years we have seen hundreds of papers trying to propose sub-quadratic attention. They all failed to get traction, big labs still use almost vanilla transformer. At some point a paper declared "mixing is all you need" (MLP-Mixers) to replace "attention is all you need". Just mixing, the optimiser adapts to what it gets.<p>If you think about it, maybe language creates a virtual layer where language operations are performed. And this works similarly in humans and AIs. That's why the architecture doesn't matter, because it is running the language-OS on top. Similarly for vision.<p>I place 90% the merits of AI on language and 10% on the model architecture. Finding intelligence was inevitable, it was hiding in language, that's how we get to be intelligent as well. A human raised without language is even worse than a primitive. Intelligence is encoded in software, not hardware. Our language software has more breadth and depth than any one of us can create or contain.