TechEcho

6 comments

modelessover 1 year ago

> our linear transformers are somewhat useless, as the positive impact from the speedup seen in long contexts is undermined by the negative impact of degraded learning.> In a future post, we will explain how to improve the learning of linear transformersSo the techniques here are useless without special secret sauce that they're not disclosing. Yet. Mamba is already out there solving similar problems, but the more the merrier. I hope they publish the useful part soon.

评论 #39040974 未加载

评论 #39038912 未加载

SmartestUnknownover 1 year ago

This is not a new algorithm. The same algorithm is described in Figure 4 (Theorem 3.1) of <a href="https://arxiv.org/pdf/2310.01655.pdf" rel="nofollow">https://arxiv.org/pdf/2310.01655.pdf</a>(Disclaimer: I am an author on the linked paper)

评论 #39037762 未加载

评论 #39050235 未加载

hacketthfkover 1 year ago

I don't understand something, why do they claim they go from O(N*N) to O(N), but all they claim they are doing is removing one exponentiation operation, which is O(1)? Where is the O(N) they are removing?

评论 #39039451 未加载

评论 #39043539 未加载

thomasahleover 1 year ago

To be honest this makes me less excited about linear transformers.If even heavily optimized, they are still (nearly) no better than normal flash attention up to context length 10^4.And then you haven't even started to account for the degradation in learning.Maybe if you're doing 100k attention at inference it starts making sense... But then there are other methods you can start using too.

评论 #39048075 未加载

deepsquirrelnetover 1 year ago

Great writeup and interesting experiments. I can’t help but wonder what would happen in you instead made a rectified linear attention. Is that even possible?

bbertelsenover 1 year ago

What did they use to build this site? I could have sworn I saw what looked like LaTex when it was loading?

评论 #39037471 未加载

评论 #39037465 未加载

6 comments

modelessover 1 year ago

评论 #39040974 未加载

评论 #39038912 未加载

SmartestUnknownover 1 year ago

评论 #39037762 未加载

评论 #39050235 未加载

hacketthfkover 1 year ago

评论 #39039451 未加载

评论 #39043539 未加载

thomasahleover 1 year ago

评论 #39048075 未加载

deepsquirrelnetover 1 year ago

Great writeup and interesting experiments. I can’t help but wonder what would happen in you instead made a rectified linear attention. Is that even possible?

bbertelsenover 1 year ago

What did they use to build this site? I could have sworn I saw what looked like LaTex when it was loading?

评论 #39037471 未加载

评论 #39037465 未加载

Linear transformers are faster after all

6 comments

Linear transformers are faster after all

6 comments