This looks like a very interesting paper that takes the rare approach of actually trying to understand what all the cool new language models are doing at a fundamental level.<p>Does anyone with more knowledge of the relevant mathematics (group theory and so on) care to chime in?