TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Tree Attention: Topology-Aware Decoding for Long-Context

79 点作者 diwank9 个月前

5 条评论

mjburgess9 个月前
I recall reading recently that someone went back and trained an RNN at a similar scale to a GPT and got similar performance on modern hardware (perhaps someone can link me that paper?).<p>ie., the innovation in statistical AI isn&#x27;t in making the algorithms &quot;smarter&quot;, it&#x27;s finding ways to align the computation with modern GPU hardware -- this has been the story since 2012.<p>In the end, the function all such algs are approximating is a conditional probability. ie., the perfect answer to any prompt is to ignore training entirely, and at inference time, compute an expectation across all historical data. All training does is essentially optimally cache a large part of that computation.<p>This is very different to how it&#x27;s typically sold&#x2F;understood, in the sense that there&#x27;s an appearance that at inference-time some unbounded computation is going on, ie., &quot;thinking&quot;&#x2F;&quot;reasoning&quot;&#x2F;etc. But at inference time <i>for any prompt</i> the same amount of computation is used, regardless of the question complexity. So the system will appear to reason (etc.) if it can sample convincingly from its pre-cached computation.<p>This means &quot;innovation&quot; here follows a moore&#x27;s law S-curve for GPU hardware.
评论 #41224033 未加载
评论 #41223475 未加载
评论 #41224920 未加载
评论 #41221597 未加载
评论 #41224951 未加载
brrrrrm9 个月前
how does this approach differ from Nvidia&#x27;s 2019 writeup on using trees to improve allreduce operations? <a href="https:&#x2F;&#x2F;developer.nvidia.com&#x2F;blog&#x2F;massively-scale-deep-learning-training-nccl-2-4&#x2F;" rel="nofollow">https:&#x2F;&#x2F;developer.nvidia.com&#x2F;blog&#x2F;massively-scale-deep-learn...</a>
评论 #41223459 未加载
tveita9 个月前
The same authors also have a language model at <a href="https:&#x2F;&#x2F;github.com&#x2F;Zyphra&#x2F;Zamba2">https:&#x2F;&#x2F;github.com&#x2F;Zyphra&#x2F;Zamba2</a> but it&#x27;s not clear to me if that model is connected to tree attention.<p>The announcement at <a href="https:&#x2F;&#x2F;www.zyphra.com&#x2F;post&#x2F;zamba2-small" rel="nofollow">https:&#x2F;&#x2F;www.zyphra.com&#x2F;post&#x2F;zamba2-small</a> links to this paper, but the paper doesn&#x27;t actually mention Zamba2 anywhere.
Narhem9 个月前
How often do papers like this make it to industry applications&#x2F;published research. Seems stuck in between the two.
评论 #41221450 未加载
评论 #41220576 未加载
cs7029 个月前
Interesting.<p>The authors claim this outperforms Ring Attention for distributed computation of self-attention over multiple GPUs.<p>Distributing computation is necessary whenever context is too long for self-attention&#x27;s computation to fit in a single GPU&#x27;s available memory.<p>Github link: <a href="https:&#x2F;&#x2F;github.com&#x2F;Zyphra&#x2F;tree_attention">https:&#x2F;&#x2F;github.com&#x2F;Zyphra&#x2F;tree_attention</a>