科技回声

5 条评论

girvo超过 1 年前

From their GitHub[0]:> Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.Impressive result and very exciting if it holds true! The hybridisation idea (preloading known-"hot" neurons to the GPU and leaving "cold" ones on the CPU) is a neat one on the surface.[0] <a href="https://github.com/SJTU-IPADS/PowerInfer">https://github.com/SJTU-IPADS/PowerInfer</a>

评论 #38703645 未加载

LoganDark超过 1 年前

> PowerInfer’s source code is publicly available at <a href="https://github.com/SJTU-IPADS/PowerInfer">https://github.com/SJTU-IPADS/PowerInfer</a>---Just curious - PowerInfer seems to market itself by running very large models (40B, 70B) on something like a 4090. If I have, say, a 3060 12GB, and I want to run something like a 7B or 13B, can I expect the same speedup of around 10x? Or does this only help that much for models that wouldn't already fit in VRAM?

评论 #38703511 未加载

评论 #38705830 未加载

chsasank超过 1 年前

This is basically fork of llama.cpp. I created a PR to see the diff and added my comments on it: <a href="https://github.com/ggerganov/llama.cpp/pull/4543">https://github.com/ggerganov/llama.cpp/pull/4543</a>One thing that caught my interest is this line from their readme:> PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers.Apple's Metal/M3 is perfect for this because CPU and GPU share memory. No need to do any data transfers. Checkout mlx from apple: <a href="https://github.com/ml-explore/mlx">https://github.com/ml-explore/mlx</a>

Animats超过 1 年前

That's an impressive result. LLMs should soon be much cheaper to run.

Havoc超过 1 年前

Really clever trick. GPU/CPU splits are currently painfully slow so this may just make them more bearable.

5 条评论

girvo超过 1 年前

评论 #38703645 未加载

LoganDark超过 1 年前

评论 #38703511 未加载

评论 #38705830 未加载

chsasank超过 1 年前

Animats超过 1 年前

That's an impressive result. LLMs should soon be much cheaper to run.

Havoc超过 1 年前

Really clever trick. GPU/CPU splits are currently painfully slow so this may just make them more bearable.

PowerInfer: Fast Large Language Model Serving with a Consumer-Grade GPU [pdf]

5 条评论

PowerInfer: Fast Large Language Model Serving with a Consumer-Grade GPU [pdf]

5 条评论