Upshot of the paper -- right now KV caches are implemented for multiple deep layers in LLMs. Why not just the top layer? It would save memory.<p>Initial result -- those KV caches in lower layers matter, and output suffered.<p>Updated plan -- cull half the KV layers! This works 'nearly' as well as keeping all of them, with memory and compute savings.<p>Downside - triple the training, worse out of band / long context performance.<p>This feels to me like a technique you'd use on a particular architecture deployed at the edge where compute matters and you have a little extra room on performance. Phi-3 on raspberry pi, basically.<p>Interesting! As always, I wish models showed prompt output in their papers, not just perplexity numbers. But, here we are.