TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Hardware Design for LLM Inference: Von Neumann Bottleneck

3 点作者 jdkee超过 1 年前

1 comment

andy99超过 1 年前
This is what some of the new dedicated AI chips are designed to overcome. <a href="https:&#x2F;&#x2F;www.untether.ai&#x2F;technology" rel="nofollow">https:&#x2F;&#x2F;www.untether.ai&#x2F;technology</a> explicitly calls out the issue with Von Neumann architecture and has cells that combine compute and memory in one place. I&#x27;m pretty sure <a href="https:&#x2F;&#x2F;groq.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;groq.com&#x2F;</a> has a similar concept.<p>Some interesting stuff happens when you&#x27;re memory bandwidth limited, particularly parallelism doesn&#x27;t help, and it becomes faster, in an LLM, to use quantized 16-bit weights that get converted to float32 when used because the cpu can convert, multiply and add faster 16 bit values faster than memory can move 32 bit values to the cpu.