DeepSeek's Multi-Head Latent Attention

4 点作者 the_origami_fox3 个月前

1 comment

fspeech3 个月前

Matrix absorption is unnecessary. What is needed is the order of multiplication associates towards the direction of the absorption. This and the modified Rope are needed to make the caching work.