科技回声

5 条评论

I recall reading recently that someone went back and trained an RNN at a similar scale to a GPT and got similar performance on modern hardware (perhaps someone can link me that paper?).ie., the innovation in statistical AI isn't in making the algorithms "smarter", it's finding ways to align the computation with modern GPU hardware -- this has been the story since 2012.In the end, the function all such algs are approximating is a conditional probability. ie., the perfect answer to any prompt is to ignore training entirely, and at inference time, compute an expectation across all historical data. All training does is essentially optimally cache a large part of that computation.This is very different to how it's typically sold/understood, in the sense that there's an appearance that at inference-time some unbounded computation is going on, ie., "thinking"/"reasoning"/etc. But at inference time for any prompt the same amount of computation is used, regardless of the question complexity. So the system will appear to reason (etc.) if it can sample convincingly from its pre-cached computation.This means "innovation" here follows a moore's law S-curve for GPU hardware.

评论 #41224033 未加载

评论 #41223475 未加载

评论 #41224920 未加载

评论 #41221597 未加载

评论 #41224951 未加载

brrrrrm9 个月前

how does this approach differ from Nvidia's 2019 writeup on using trees to improve allreduce operations? <a href="https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/" rel="nofollow">https://developer.nvidia.com/blog/massively-scale-deep-learn...</a>

评论 #41223459 未加载

tveita9 个月前

The same authors also have a language model at <a href="https://github.com/Zyphra/Zamba2">https://github.com/Zyphra/Zamba2</a> but it's not clear to me if that model is connected to tree attention.The announcement at <a href="https://www.zyphra.com/post/zamba2-small" rel="nofollow">https://www.zyphra.com/post/zamba2-small</a> links to this paper, but the paper doesn't actually mention Zamba2 anywhere.

Narhem9 个月前

How often do papers like this make it to industry applications/published research. Seems stuck in between the two.

评论 #41221450 未加载

评论 #41220576 未加载

cs7029 个月前

Interesting.The authors claim this outperforms Ring Attention for distributed computation of self-attention over multiple GPUs.Distributing computation is necessary whenever context is too long for self-attention's computation to fit in a single GPU's available memory.Github link: <a href="https://github.com/Zyphra/tree_attention">https://github.com/Zyphra/tree_attention</a>

5 条评论

mjburgess9 个月前

评论 #41224033 未加载

评论 #41223475 未加载

评论 #41224920 未加载

评论 #41221597 未加载

评论 #41224951 未加载

brrrrrm9 个月前

评论 #41223459 未加载

tveita9 个月前

Narhem9 个月前

How often do papers like this make it to industry applications/published research. Seems stuck in between the two.

评论 #41221450 未加载

评论 #41220576 未加载

cs7029 个月前

Tree Attention: Topology-Aware Decoding for Long-Context

5 条评论

Tree Attention: Topology-Aware Decoding for Long-Context

5 条评论