TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

New technology could blow away GPT-4 and everything like it

102 pointsby andy_threos_ioabout 2 years ago

7 comments

saurabh20nabout 2 years ago
Notes from quick read of paper at <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2302.10866" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2302.10866</a>. Title of popsci is overreaching, this is a drop-in subquadratic replacement for attention. Could be promising, but to be seen if it is adopted in practice. skybrian (<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35657983" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35657983</a>) points out new blog post by authors, and prev discussion of older (march 28th) blog post. Takeaways:<p>* In standard attention in transformers, cost scales quadratically with length of sequence, which restricts model context. This work presents subquadratic exact operator allowing it to scale to larger contexts (100k+).<p>* They introduce an operator called &quot;Hyena hierarchy&quot;, a recurrence over 2 subquadratic operations: long convolution, and element-wise mul gating. Sec 3.1-3.3 define the recurrences, matrices, and filters. This is importantly, a drop in replacement for attention.<p>* Longer context: 100x speedup over FlashAttention at 64k context (if we view flash attention as an non-approx engg optimization, then this work is improving algorithmically, and getting OOM over that). Associate recall, i.e., just pull data, show improvements: Experiments on 137k context, and vocab sizes of 10-40 (unsure why they have bad recall on small length sequence with larger vocab, but they still outperform others)<p>* Comparisons (on relatively small models, but hoping to show pattern) with RWKV (attention-free model, trained on 332B tokens), GPTNeo (trained on 300B tokens), with Hyena trained on 137B tokens. Models are 125M-355M sized. (Section 4.3)<p>* On SuperGLUE, zero-shot and 3-shot accuracy is ballpark similar to GPTNeo (although technically they underperform a bit for zero-shot and overperform a bit for 3-shot). (Table 4.5 and 4.6)<p>* Because they can support large (e.g., 100k+) context, they can do image classification. They report ballpark comparable against others. (Table 4.7)<p>Might have misread some takeaways; happy to be corrected.
skybrianabout 2 years ago
Blog post: <a href="https:&#x2F;&#x2F;hazyresearch.stanford.edu&#x2F;blog&#x2F;2023-03-07-hyena" rel="nofollow">https:&#x2F;&#x2F;hazyresearch.stanford.edu&#x2F;blog&#x2F;2023-03-07-hyena</a><p>Previous discussion: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35502187" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35502187</a>
LoganDarkabout 2 years ago
The biggest thing I&#x27;m worried about is whether the unlimited context (via convolution) is anything like picking a random set of samples. If I give it a super long context and then tell it to recall something at the very beginning, will it be able to do it? Will it be able to parse the question if it&#x27;s a very small percentage of the total context, even if the question is at the very end? Or is it just less computationally expensive to process the entire context with this method?
PaulHouleabout 2 years ago
I had so much fun with CNN models just before BERT hit it big. It would be nice to see them make a comeback.
评论 #35662646 未加载
barbariangrungeabout 2 years ago
&gt; At 64,000 tokens, the authors relate, &quot;Hyena speed-ups reach 100x&quot; -- a one-hundred-fold performance improvement.<p>That’s quite the difference
评论 #35658055 未加载
sharemywinabout 2 years ago
I didn&#x27;t see anything in the article about what the scaling factor was? less than P^2 but what was it?
评论 #35657001 未加载
评论 #35657973 未加载
galaxytachyonabout 2 years ago
How good is it at scaling? And will it still retain the emergent capabilities of the huge transformer LLMs?<p>Isn&#x27;t this basically the bitter lesson again? Making small improvements work but in long term it won&#x27;t give the same impressive result?
评论 #35660342 未加载