> Differential attention takes the difference between two softmax attention functions to eliminate attention noise<p>If I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality.<p>> According to the fitted curves, 6.8B-size
DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters<p>This raises a few questions for me:<p>- Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer?<p>- Does that tradeoff change noticeably between training and inference?