TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Mixture-of-Depths: Dynamically allocating compute in transformers

281 点作者 milliondreams大约 1 年前

13 条评论

whimsicalism大约 1 年前
I think more complicated routing is absolutely going to become more common.<p>Specifically, I think at some point we are going to move to recursive routing, ie. pass back through a set of experts again. In the future, &#x27;chain-of-thought&#x27; will happen internal to the model recursively
评论 #39962245 未加载
评论 #39962719 未加载
评论 #39963347 未加载
评论 #39963304 未加载
评论 #39965452 未加载
评论 #39969429 未加载
评论 #39962345 未加载
评论 #39965975 未加载
nl大约 1 年前
Most important paper of 2024.<p>The idea that we want models not to have to use the same amount of compute for every token has been around for a while. This is the first compelling mechanism I&#x27;ve seen for doing it.<p>&gt; Equipped with these new methods, we can sample autoregressively by choosing to route tokens to or around a block based on the router’s output, which does not depend on any information from future tokens. We provide empirical evidence that this is a relatively easy auxiliary task that quickly achieves 99% accuracy.<p>Does anyone else find this is a bit surprising?
评论 #39967266 未加载
评论 #39968521 未加载
panqueca大约 1 年前
Simplified Intro Version:<p>Imagine you have a smart assistant that can understand and process the words you say to it. Usually, this assistant pays equal attention to every word you say, no matter how important or unimportant each word is to the overall meaning of your message.<p>Now, imagine that we found a way to teach the assistant to be smarter about how it uses its &quot;brain power.&quot; Instead of giving equal attention to every word, the assistant learns to focus more on the words that are most important for understanding what you mean. It can even adjust this focus on the fly, paying more attention to different words depending on the context of your message.<p>To make sure the assistant doesn&#x27;t get overwhelmed, we also set a limit on how much total &quot;brain power&quot; it can use at any given time. It&#x27;s like giving the assistant a budget and saying, &quot;You can only spend your brain power on a certain number of words at a time.&quot; The assistant then has to decide which words are most important to focus on.<p>Even with this limit, the assistant is still flexible in how it uses its brain power. It might spend more on certain words and less on others, depending on what you&#x27;re saying. This means that while we always know the total amount of brain power the assistant is using, it can adapt to different situations and prioritize what&#x27;s most important.<p>When we teach the assistant using this method, it not only learns to focus its attention intelligently but also does so very efficiently. It can understand you just as well as an assistant that pays equal attention to every word, but it uses less brain power overall. This makes the assistant much faster at responding to you and processing new information.
评论 #39966330 未加载
评论 #39965323 未加载
mattmcdonagh大约 1 年前
I wrote up a bit about it here, from what I could piece together:<p><a href="https:&#x2F;&#x2F;lifeinthesingularity.com&#x2F;p&#x2F;googles-breakthroughs-in-ai-design" rel="nofollow">https:&#x2F;&#x2F;lifeinthesingularity.com&#x2F;p&#x2F;googles-breakthroughs-in-...</a>
评论 #39965487 未加载
rughouse大约 1 年前
It’s very similar to Mixture of Experts. But instead of routing tokens to multiple experts, you &quot;deploy to a single expert which can be dynamically skipped&quot;
评论 #39962256 未加载
macrolime大约 1 年前
&quot;This is more computationally efficient than performing a full content-based lookup across an entire memory buffer for each step in the future, and could be one step towards drastically increasing the context-length available for making a prediction.&quot;<p>Is this how they get a context window of 10 million tokens? Or are they refering to even longer context windows in the future?
nikvaes大约 1 年前
After trying to understand and implement some algorithms in RASP [1, 2], my take-way was that certain functions need a certain amount of transformer layers to operate. Following this logic, it should become apparent that the functions learned by transformers can be spread over multiple heads. Repeating these functions might be very valuable for understanding and solving a problem, but current inference does not allow (a set of subsequent) heads to be repeated. This paper indeed seems a promising direction.<p>[1] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2106.06981.pdf" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2106.06981.pdf</a><p>[2] <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=t5LjgczaS80" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=t5LjgczaS80</a>
edude03大约 1 年前
Maybe the only downside to how fast LLMs are moving is papers come out faster than anyone (not at Google) can train and test the improvements.<p>I got into deep learning around when ReLU and dropout was hot and on my consumer 1080 I was able to change one or two lines of code and test the improvements in a few hours, whereas now, I guess I&#x27;ll need to wait a few weeks for mistral et al to try it out
评论 #39969722 未加载
yair99dd大约 1 年前
hu-po does in-depth live-stream reviews of AI papers.<p>highly recommended, here is his take on the mixture-of-depths paper discussed. <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=Teru_qIdB8Y" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=Teru_qIdB8Y</a>
评论 #39970575 未加载
maxrumpf大约 1 年前
The abstract and the rest of the paper don&#x27;t really match imo. It&#x27;s not really allocating more to some sequences, but just introducing ~dropout. Might be different sides to the same coin, but was still a weird read.
评论 #39964908 未加载
评论 #39965734 未加载
kromem大约 1 年前
Essentially the second law of thermodynamics for neural networks.<p>Neat!
评论 #39965367 未加载
modeless大约 1 年前
It&#x27;s a start but it&#x27;s disappointing that half the layers still have to process every token. It seems like we ought to be able to get to 90% or even 99% savings when these models currently allocate the same compute for outputting &quot;the&quot; as they do for outputting the first digit of the answer of a complicated math problem.
评论 #39965771 未加载
barrenko大约 1 年前
Are we going to hit bullseye?
评论 #39962842 未加载