Note that this works within a single sequence of tokens. It might be consistent with "episodic memory" metaphor if we consider a particular transformer run as its experience.<p>But this might be very different from what people expect from "memory" - i.e. ability to learn vast amounts of information and retrieve it as necessary.<p>This is more like a refinement of transformer attention: instead of running attention over all tokens (which is very expensive as it's quadratic), it selects a subset of token spans and runs fine-grained attention only on those. So it essentially breaks transformer attention into two parts - coarse-grained (k-NN over token spans) and fine-grained (normal).<p>It might be a great thing for long-context situations. But it doesn't make sense when you want millions of different facts to be considered - making them into long context is rather inefficient.