TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

PowerInfer: Fast Large Language Model Serving with a Consumer-Grade GPU [pdf]

84 点作者 georgehill超过 1 年前

5 条评论

girvo超过 1 年前
From their GitHub[0]:<p>&gt; Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens&#x2F;s, with a peak of 29.08 tokens&#x2F;s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.<p>Impressive result and very exciting if it holds true! The hybridisation idea (preloading known-&quot;hot&quot; neurons to the GPU and leaving &quot;cold&quot; ones on the CPU) is a neat one on the surface.<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;SJTU-IPADS&#x2F;PowerInfer">https:&#x2F;&#x2F;github.com&#x2F;SJTU-IPADS&#x2F;PowerInfer</a>
评论 #38703645 未加载
LoganDark超过 1 年前
&gt; PowerInfer’s source code is publicly available at <a href="https:&#x2F;&#x2F;github.com&#x2F;SJTU-IPADS&#x2F;PowerInfer">https:&#x2F;&#x2F;github.com&#x2F;SJTU-IPADS&#x2F;PowerInfer</a><p>---<p>Just curious - PowerInfer seems to market itself by running very large models (40B, 70B) on something like a 4090. If I have, say, a 3060 12GB, and I want to run something like a 7B or 13B, can I expect the same speedup of around 10x? Or does this only help that much for models that wouldn&#x27;t already fit in VRAM?
评论 #38703511 未加载
评论 #38705830 未加载
chsasank超过 1 年前
This is basically fork of llama.cpp. I created a PR to see the diff and added my comments on it: <a href="https:&#x2F;&#x2F;github.com&#x2F;ggerganov&#x2F;llama.cpp&#x2F;pull&#x2F;4543">https:&#x2F;&#x2F;github.com&#x2F;ggerganov&#x2F;llama.cpp&#x2F;pull&#x2F;4543</a><p>One thing that caught my interest is this line from their readme:<p>&gt; PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers.<p>Apple&#x27;s Metal&#x2F;M3 is perfect for this because CPU and GPU share memory. No need to do any data transfers. Checkout mlx from apple: <a href="https:&#x2F;&#x2F;github.com&#x2F;ml-explore&#x2F;mlx">https:&#x2F;&#x2F;github.com&#x2F;ml-explore&#x2F;mlx</a>
Animats超过 1 年前
That&#x27;s an impressive result. LLMs should soon be much cheaper to run.
Havoc超过 1 年前
Really clever trick. GPU&#x2F;CPU splits are currently painfully slow so this may just make them more bearable.