TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Linear transformers are faster after all

116 pointsby JoelEinbinderover 1 year ago

6 comments

modelessover 1 year ago
&gt; our linear transformers are somewhat useless, as the positive impact from the speedup seen in long contexts is undermined by the negative impact of degraded learning.<p>&gt; In a future post, we will explain how to improve the learning of linear transformers<p>So the techniques here are useless without special secret sauce that they&#x27;re not disclosing. Yet. Mamba is already out there solving similar problems, but the more the merrier. I hope they publish the useful part soon.
评论 #39040974 未加载
评论 #39038912 未加载
SmartestUnknownover 1 year ago
This is not a new algorithm. The same algorithm is described in Figure 4 (Theorem 3.1) of <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2310.01655.pdf" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2310.01655.pdf</a><p>(Disclaimer: I am an author on the linked paper)
评论 #39037762 未加载
评论 #39050235 未加载
hacketthfkover 1 year ago
I don&#x27;t understand something, why do they claim they go from O(N*N) to O(N), but all they claim they are doing is removing one exponentiation operation, which is O(1)? Where is the O(N) they are removing?
评论 #39039451 未加载
评论 #39043539 未加载
thomasahleover 1 year ago
To be honest this makes me less excited about linear transformers.<p>If even heavily optimized, they are still (nearly) no better than normal flash attention up to context length 10^4.<p>And then you haven&#x27;t even started to account for the degradation in learning.<p>Maybe if you&#x27;re doing 100k attention at inference it starts making sense... But then there are other methods you can start using too.
评论 #39048075 未加载
deepsquirrelnetover 1 year ago
Great writeup and interesting experiments. I can’t help but wonder what would happen in you instead made a rectified linear attention. Is that even possible?
bbertelsenover 1 year ago
What did they use to build this site? I could have sworn I saw what looked like LaTex when it was loading?
评论 #39037471 未加载
评论 #39037465 未加载