TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Scalable-Softmax Is Superior for Attention

2 pointsby jw12243 months ago

2 comments

jw12243 months ago
Abstract:<p>&gt; The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model&#x27;s ability to prioritize key information effectively and potentially limits its length generalization.<p>&gt; To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval.<p>&gt; Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining
sidkshatriya3 months ago
There is a hyperparameter `s` in scalable softmax.<p>SSoftMax_i = exp(s log(n) z_i) &#x2F; sum (n is length of embedding).<p>Normal softmax (with temperature)<p>SoftMax_i = exp(z_i &#x2F; T) &#x2F; sum (T is the temperature).<p>Here Temperature is a hyperparameter. Having a temperature as hyperparameter does not seem too different to me than having `s` as a hyperparameter. I personally don&#x27;t understand the benefits of SSoftMax. During hyperparameter search you would find the optimal `s` as you might find the optimal temperature `T`.