TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

NSA: Hardware-Aligned and Natively Trainable Sparse Attention

4 pointsby unignorant3 months ago

2 comments

crsn3 months ago
Very f’ing cool (esp. optimistic about repo-level codebase completion) – but just like many other results that DeepSeek reports, their preprint leaves me with more questions than they’ve given answers, unless I’ve misunderstood multiple pieces of it (which of course is possible):<p>—They report a 9.0× speedup in forward pass and 6.0× in backward pass… Why the heck would the backward pass be so much slower? Is it their gating mechanisms needing extra computation in backward passes? Gradient accumulation or KV-cache updates bottlenecking the speedup? FlashAttention (or at least FlashAttention-2) gives a near-equal forward-backward efficiency… They claim it’s tuned for FA2-style blockwise layouts, so which of their (competing) claims is wrong?<p>—Does NSA actually learn useful sparsity, or just get lucky with pretraining? How much of the performance gain comes from pretrained sparsity patterns vs. sparsity inherent to the attention? Even though they themselves say “applying sparsity post-hoc forces models to deviate from their pretrained optimization trajectory… As demonstrated by Chen et al. (2024), [sic] top 20% attention can only cover 70% of the total attention scores, rendering structures like retrieval heads in pretrained models vulnerable to pruning during inference” — yet their ablation isn’t strong enough to tell. A stronger ablation would include (1) a Full Attention → NSA transition test to measure whether NSA can be applied post-hoc without degradation, (2) a visualization of learned sparsity patterns over training epochs, and (3) a test where sparsity constraints are randomly assigned to see if NSA actually finds useful structures or just adapts to imposed ones.<p>—Training transformers with sparse attention is historically unstable — early MoEs like Switch-Transformer (which use expert gating-like mechanisms just like this one) were famous specifically for their collapse issues. How does NSA prevent mode collapse in early training — or really, how do we know it’s not just going to collapse different (i.e. more common) initialization schemes? If their technique doesn’t have an explicit mechanism for counteracting sparse expert underutilization, then it’s just as vulnerable to collapse as (e.g.) Switch-Transformer — but worse, since sparsity here isn’t just a gating function, it’s the core of the entire attention mechanism…
sidkshatriya3 months ago
This is a paper by DeepSeek. It would be a good idea to mention that in the title.<p>TL;DR: This is a very interesting paper about attention calculation in transformers. It shows how attention can be calculated over a large token window without saturating memory and&#x2F;or GPU arithmetic abilities.<p>Usually attention is a sliding window of tokens. The window can turn out to be too big due to the quadratic nature of attention which increases the amount of computation required. There are many papers on how to get some of the benefits of transformers by doing &quot;sparse attention&quot; -- i.e. avoiding some of the quadratic blowup.<p>The solution in the paper is first divide every `x` tokens into groups or &quot;blocks&quot;.<p>(1) Capture long range conections by compressing blocks of tokens to a single token<p>(2) Select important tokens by only choosing the tokens in the &quot;important&quot; blocks<p>(3) Select recent tokens by using a sliding window (like normal transformers)<p>Compression of a block of tokens to a single token in (1) is done by an MLP that is trained during normal training time.<p>Now attention scores can be done for an incoming token with the preceding block of tokens. Select only the top-k blocks which have high attention scores for (1).<p>Finally combine the results of attention of incoming tokens with (1), (2) and (3) to give you a final output token. You get long range coarse attention, attention to selective blocks and the usual sliding window attention. Awesome !<p>This is sort of engineering type paper also with lots of low level details.<p>Question for the authors: Why not do the experiments with MHLA also (multi head latent attention) that is there in DeepSeek V3 and R1 ?