TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

High-Speed Large Language Model Serving on PCs with Consumer-Grade GPUs

417 pointsby dataminerover 1 year ago

16 comments

phhover 1 year ago
Took me a while to understand what their &quot;hot&quot; and &quot;cold&quot; neurons meant, since in most ML I do, there is no such notion. And their paper doesn&#x27;t directly define it (or I missed it)<p>After some thoughts, in ReLU it does make sense, because half of the function is constant, so you can say that you&#x27;re &quot;cold&quot; if that neuron&#x27;s ReLU-ed output is often 0 . So I checked whether ReLU was common in LLMs, original llama doesn&#x27;t use ReLU. But after (re-)reading the github, it actually only works on ReLU models. Turns out that there is a group of people &quot;fine-tuning&quot; (I would rather call that re-training, since you start by breaking the model?) models to use ReLU to allow for that sparsity: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;SparseLLM" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;SparseLLM</a><p>So this is sadly not applicable to any model you can find on the internet, but that sounds like a great progress anyway. Possibly this might shift the compromises back to bigger models but with &quot;less ideal&quot; activations. Also I&#x27;m curious what would be the legal impacts on it (since USA and EU refers to a model&#x27;s FLOPs&#x2F;number of parameters... How do you compute it with sparsity? Do you average?)<p>I think that a possible avenue for future research in that area is keeping original activation (like llama keeping SwiGLU), but using quantification to define &quot;hot&quot; and &quot;cold&quot; neurons to be saturation areas. (For example, saying that this activation function, below -1. at 8 bit, is equivalent to -infinity, and thus this is a cold neuron)
评论 #38711021 未加载
评论 #38710398 未加载
评论 #38710617 未加载
127over 1 year ago
Running uncensored Mixtral on this would be really nice. More than 3 bits quantized for 4090.
评论 #38709870 未加载
评论 #38711451 未加载
评论 #38710044 未加载
评论 #38716277 未加载
Const-meover 1 year ago
Since they mentioned they’re working on Mistral-7B, I’d like to note that my GPU-only implementation of Mistral uses slightly over 5GB of VRAM: <a href="https:&#x2F;&#x2F;github.com&#x2F;Const-me&#x2F;Cgml">https:&#x2F;&#x2F;github.com&#x2F;Const-me&#x2F;Cgml</a><p>Runs pretty good on most consumer-grade GPUs, but so far it only supports Windows OS.
评论 #38717332 未加载
评论 #38717822 未加载
brucethemoose2over 1 year ago
This is super cool.<p>For all the love llama.cpp gets, its method of dGPU offloading (prompt processing on GPU and then just splitting the model down the middle) is relatively simple. But its interesting that there even <i>is</i> so much &quot;activation sparsity&quot; to take advantage of. The traditional thinking in ML is that memory access is very random.<p>Hopefully the &quot;cold&quot; neurons eventually get offloaded to the IGP instead?<p>Also, its curious that they are considering a Metal kernel. I thought the performance advantage came from the hybrid memory pool... seems like that would only help old AMD Macs, unless I am missing something?
评论 #38711185 未加载
jupp0rover 1 year ago
From my understanding in this implementation there is some amount of knowledge about the model itself needed to determine what parts to place in system memory vs what parts to place in GPU memory. Can this ideally be computed automatically or will future models have some sort of interface for placement algorithms like this to help automate this? If the algorithm needs to be adopted for each model architecture, it&#x27;s going to be a lot of work to maintain this project.
评论 #38710492 未加载
EwanGover 1 year ago
The important stuff from the readme (if you&#x27;re not looking to tinker with it directly):<p>We have tested PowerInfer on the following platforms:<p>x86-64 CPU (with AVX2 instructions) on Linux<p>x86-64 CPU and NVIDIA GPU on Linux<p>Apple M Chips on macOS (As we do not optimize for Mac, the performance improvement is not significant now.)<p>And new features coming soon:<p>Mistral-7B model<p>Metal backend for sparse inference on macOS
评论 #38714641 未加载
peter_d_shermanover 1 year ago
&gt;&quot;This distribution indicates that a small subset of neurons, termed <i>hot neurons</i>, are consistently activated across inputs, while the majority, <i>cold neurons</i>, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers.&quot;<p>Brilliant!
modelessover 1 year ago
Everyone compares against llama.cpp because it&#x27;s easy mode. Llama.cpp is slow! Everyone should know this. They should compare against exllamav2 or other optimized implementations.
评论 #38713932 未加载
评论 #38712847 未加载
评论 #38710924 未加载
评论 #38711286 未加载
superkuhover 1 year ago
This will be really cool once there&#x27;s the ability to generate the sparse predictor files for arbitrary models rather than just the 4 they&#x27;ve done it with. Looking through the page and code it doesn&#x27;t seem like the tools to do that step are included. Guess I&#x27;ll wait on this one a bit. Hopefully these features will be merged back into llama.cpp as options eventually since this is based on the normal llama.cpp code (ie, not just using the ggml matrix lib).
causality0over 1 year ago
All the &quot;consumer grade GPUs&quot; terminology makes it seem like you could run it on a variety of models, but like <i>so many</i> of these posts, is this a 4090 exclusive?
评论 #38725860 未加载
nextaccounticover 1 year ago
&gt; Hybrid CPU&#x2F;GPU Utilization: Seamlessly integrates memory&#x2F;computation capabilities of CPU and GPU for a balanced workload and faster processing.<p>Does this means that it runs at same time at both CPU and GPU, being faster than a CPU-only or a GPU-only implementation on the same device?<p>edit: when running on integrated GPUs, can this benefit from the improved communication between CPU and GPU?
评论 #38714633 未加载
PoignardAzurover 1 year ago
This sounds like it uses the same techniques as the ones described in the &quot;LLM in a Flash&quot; paper posted yesterday? If so, cool to see an implementation of these techniques running models on non-Apple GPUs.
robwwilliamsover 1 year ago
Scale-free network topology enables a crude but effective split of neurons into hot and cold classes—hot neurons at home on the GPU and larger numbers of cold neurons that benefit from more memory on the CPU. Clever!
ComputerGuruover 1 year ago
It’s not too much faster than exllama2 with flash attention, no?
ekianjoover 1 year ago
how much speed increase do we get on CPU only configurations? has anyone tested it in such cases?
评论 #38711042 未加载
评论 #38710333 未加载
coder543over 1 year ago
&quot;Power*&quot; made me think of Microsoft, so I was almost expecting this to be Windows-specific. (PowerShell, PowerPoint, Power BI, Power Apps, Power Automate... I&#x27;m probably forgetting some.)
评论 #38709815 未加载
评论 #38710912 未加载