TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

DeepSeek Native Sparse Attention

16 pointsby bandwitch3 months ago

1 comment

fovc3 months ago
Sparse attention essentially combines 3 types of attention optimizations:<p>1. Compression of the query input vectors to reduce the size of the KV cache<p>2. Selectively computing uncompressed attention on a subset of tokens based on the compressed blocks with the highest attention scores<p>3. Using sliding window for local attention at full resolution<p>&gt; Both Full Attention and sparse attention models are pretrained on 270⁢B tokens of 8⁢k-length texts, followed by continued training and supervised fine-tuning on 32⁢k-length texts with YaRN to achieve long-context adaptation. Both models are trained to full convergence to ensure fair comparison.<p>&gt; our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27⁢B total parameters with 3⁢B active parameters<p>Evaluated on MMLU, MMLU-PRO, CMMLU, BBH, GSM8K, MATH, DROP, MBPP, and HumanEval. NSA outperforms full attention on 7&#x2F;9.<p>Beats out H2O, InfLLM, Quest, Exact-Top, and full attention on LongBench<p>Perfect retrieval on 64k needle-in-a-haystack<p>The CoT eval is less convincing, but outperforms the FA on AIME24.<p>Training speed of 2-9x vs. FlashAttention<p>Decoding speedup of 4-12x vs. full attention [&quot;expected&quot;? Didn&#x27;t see comparison to other attention mechanisms]