TE
TechEcho
Home
24h Top
Newest
Best
Ask
Show
Jobs
English
GitHub
Twitter
Home
DeepSeek's Multi-Head Latent Attention
4 points
by
the_origami_fox
3 months ago
1 comment
fspeech
3 months ago
Matrix absorption is unnecessary. What is needed is the order of multiplication associates towards the direction of the absorption. This and the modified Rope are needed to make the caching work.