Research suggesting that much of the power of the transformer architecture comes from associative recall over long sequences that does not require scaling model dimensions. They design state space models that narrow the gap.<p>Overview
<a href="https://hazyresearch.stanford.edu/blog/2023-12-11-zoology0-intro" rel="nofollow noreferrer">https://hazyresearch.stanford.edu/blog/2023-12-11-zoology0-i...</a><p>Zoology 2
<a href="https://hazyresearch.stanford.edu/blog/2023-12-11-zoology2-based" rel="nofollow noreferrer">https://hazyresearch.stanford.edu/blog/2023-12-11-zoology2-b...</a><p>Monarchs and Butterflies: Towards Sub-Quadratic Scaling in Model Dimension. Scaling in model dimension as opposed to sequence dimension scaling in the previous posts.
<a href="https://hazyresearch.stanford.edu/blog/2023-12-11-truly-subquadratic" rel="nofollow noreferrer">https://hazyresearch.stanford.edu/blog/2023-12-11-truly-subq...</a>