TechEcho

1 comment

fovc3 months ago

Sparse attention essentially combines 3 types of attention optimizations:1. Compression of the query input vectors to reduce the size of the KV cache2. Selectively computing uncompressed attention on a subset of tokens based on the compressed blocks with the highest attention scores3. Using sliding window for local attention at full resolution> Both Full Attention and sparse attention models are pretrained on 270⁢B tokens of 8⁢k-length texts, followed by continued training and supervised fine-tuning on 32⁢k-length texts with YaRN to achieve long-context adaptation. Both models are trained to full convergence to ensure fair comparison.> our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27⁢B total parameters with 3⁢B active parametersEvaluated on MMLU, MMLU-PRO, CMMLU, BBH, GSM8K, MATH, DROP, MBPP, and HumanEval. NSA outperforms full attention on 7/9.Beats out H2O, InfLLM, Quest, Exact-Top, and full attention on LongBenchPerfect retrieval on 64k needle-in-a-haystackThe CoT eval is less convincing, but outperforms the FA on AIME24.Training speed of 2-9x vs. FlashAttentionDecoding speedup of 4-12x vs. full attention ["expected"? Didn't see comparison to other attention mechanisms]

DeepSeek Native Sparse Attention

1 comment

DeepSeek Native Sparse Attention

1 comment