TE
科技回声
首页
24小时热榜
最新
最佳
问答
展示
工作
中文
GitHub
Twitter
首页
DeepSeek's Multi-Head Latent Attention
4 点
作者
the_origami_fox
3 个月前
1 comment
fspeech
3 个月前
Matrix absorption is unnecessary. What is needed is the order of multiplication associates towards the direction of the absorption. This and the modified Rope are needed to make the caching work.