TechEcho

7 comments

saurabh20nabout 2 years ago

Notes from quick read of paper at <a href="https://arxiv.org/abs/2302.10866" rel="nofollow">https://arxiv.org/abs/2302.10866</a>. Title of popsci is overreaching, this is a drop-in subquadratic replacement for attention. Could be promising, but to be seen if it is adopted in practice. skybrian (<a href="https://news.ycombinator.com/item?id=35657983" rel="nofollow">https://news.ycombinator.com/item?id=35657983</a>) points out new blog post by authors, and prev discussion of older (march 28th) blog post. Takeaways:* In standard attention in transformers, cost scales quadratically with length of sequence, which restricts model context. This work presents subquadratic exact operator allowing it to scale to larger contexts (100k+).* They introduce an operator called "Hyena hierarchy", a recurrence over 2 subquadratic operations: long convolution, and element-wise mul gating. Sec 3.1-3.3 define the recurrences, matrices, and filters. This is importantly, a drop in replacement for attention.* Longer context: 100x speedup over FlashAttention at 64k context (if we view flash attention as an non-approx engg optimization, then this work is improving algorithmically, and getting OOM over that). Associate recall, i.e., just pull data, show improvements: Experiments on 137k context, and vocab sizes of 10-40 (unsure why they have bad recall on small length sequence with larger vocab, but they still outperform others)* Comparisons (on relatively small models, but hoping to show pattern) with RWKV (attention-free model, trained on 332B tokens), GPTNeo (trained on 300B tokens), with Hyena trained on 137B tokens. Models are 125M-355M sized. (Section 4.3)* On SuperGLUE, zero-shot and 3-shot accuracy is ballpark similar to GPTNeo (although technically they underperform a bit for zero-shot and overperform a bit for 3-shot). (Table 4.5 and 4.6)* Because they can support large (e.g., 100k+) context, they can do image classification. They report ballpark comparable against others. (Table 4.7)Might have misread some takeaways; happy to be corrected.

skybrianabout 2 years ago

Blog post: <a href="https://hazyresearch.stanford.edu/blog/2023-03-07-hyena" rel="nofollow">https://hazyresearch.stanford.edu/blog/2023-03-07-hyena</a>Previous discussion: <a href="https://news.ycombinator.com/item?id=35502187" rel="nofollow">https://news.ycombinator.com/item?id=35502187</a>

LoganDarkabout 2 years ago

The biggest thing I'm worried about is whether the unlimited context (via convolution) is anything like picking a random set of samples. If I give it a super long context and then tell it to recall something at the very beginning, will it be able to do it? Will it be able to parse the question if it's a very small percentage of the total context, even if the question is at the very end? Or is it just less computationally expensive to process the entire context with this method?

PaulHouleabout 2 years ago

I had so much fun with CNN models just before BERT hit it big. It would be nice to see them make a comeback.

评论 #35662646 未加载

barbariangrungeabout 2 years ago

> At 64,000 tokens, the authors relate, "Hyena speed-ups reach 100x" -- a one-hundred-fold performance improvement.That’s quite the difference

评论 #35658055 未加载

sharemywinabout 2 years ago

I didn't see anything in the article about what the scaling factor was? less than P^2 but what was it?

评论 #35657001 未加载

评论 #35657973 未加载

galaxytachyonabout 2 years ago

How good is it at scaling? And will it still retain the emergent capabilities of the huge transformer LLMs?Isn't this basically the bitter lesson again? Making small improvements work but in long term it won't give the same impressive result?

评论 #35660342 未加载

7 comments

saurabh20nabout 2 years ago

skybrianabout 2 years ago

LoganDarkabout 2 years ago

PaulHouleabout 2 years ago

I had so much fun with CNN models just before BERT hit it big. It would be nice to see them make a comeback.

评论 #35662646 未加载

barbariangrungeabout 2 years ago

> At 64,000 tokens, the authors relate, "Hyena speed-ups reach 100x" -- a one-hundred-fold performance improvement.That’s quite the difference

评论 #35658055 未加载

sharemywinabout 2 years ago

I didn't see anything in the article about what the scaling factor was? less than P^2 but what was it?

评论 #35657001 未加载

评论 #35657973 未加载

galaxytachyonabout 2 years ago

评论 #35660342 未加载

New technology could blow away GPT-4 and everything like it

7 comments

New technology could blow away GPT-4 and everything like it

7 comments