High-Speed Large Language Model Serving on PCs with Consumer-Grade GPUs

417 pointsby dataminerover 1 year ago

16 comments

phhover 1 year ago

Took me a while to understand what their "hot" and "cold" neurons meant, since in most ML I do, there is no such notion. And their paper doesn't directly define it (or I missed it)After some thoughts, in ReLU it does make sense, because half of the function is constant, so you can say that you're "cold" if that neuron's ReLU-ed output is often 0 . So I checked whether ReLU was common in LLMs, original llama doesn't use ReLU. But after (re-)reading the github, it actually only works on ReLU models. Turns out that there is a group of people "fine-tuning" (I would rather call that re-training, since you start by breaking the model?) models to use ReLU to allow for that sparsity: <a href="https://huggingface.co/SparseLLM" rel="nofollow noreferrer">https://huggingface.co/SparseLLM</a>So this is sadly not applicable to any model you can find on the internet, but that sounds like a great progress anyway. Possibly this might shift the compromises back to bigger models but with "less ideal" activations. Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?)I think that a possible avenue for future research in that area is keeping original activation (like llama keeping SwiGLU), but using quantification to define "hot" and "cold" neurons to be saturation areas. (For example, saying that this activation function, below -1. at 8 bit, is equivalent to -infinity, and thus this is a cold neuron)

评论 #38711021 未加载

评论 #38710398 未加载

评论 #38710617 未加载

127over 1 year ago

Running uncensored Mixtral on this would be really nice. More than 3 bits quantized for 4090.

评论 #38709870 未加载

评论 #38711451 未加载

评论 #38710044 未加载

评论 #38716277 未加载

Const-meover 1 year ago

Since they mentioned they’re working on Mistral-7B, I’d like to note that my GPU-only implementation of Mistral uses slightly over 5GB of VRAM: <a href="https://github.com/Const-me/Cgml">https://github.com/Const-me/Cgml</a>Runs pretty good on most consumer-grade GPUs, but so far it only supports Windows OS.

评论 #38717332 未加载

评论 #38717822 未加载

brucethemoose2over 1 year ago

This is super cool.For all the love llama.cpp gets, its method of dGPU offloading (prompt processing on GPU and then just splitting the model down the middle) is relatively simple. But its interesting that there even is so much "activation sparsity" to take advantage of. The traditional thinking in ML is that memory access is very random.Hopefully the "cold" neurons eventually get offloaded to the IGP instead?Also, its curious that they are considering a Metal kernel. I thought the performance advantage came from the hybrid memory pool... seems like that would only help old AMD Macs, unless I am missing something?

评论 #38711185 未加载

jupp0rover 1 year ago

From my understanding in this implementation there is some amount of knowledge about the model itself needed to determine what parts to place in system memory vs what parts to place in GPU memory. Can this ideally be computed automatically or will future models have some sort of interface for placement algorithms like this to help automate this? If the algorithm needs to be adopted for each model architecture, it's going to be a lot of work to maintain this project.

评论 #38710492 未加载

EwanGover 1 year ago

The important stuff from the readme (if you're not looking to tinker with it directly):We have tested PowerInfer on the following platforms:x86-64 CPU (with AVX2 instructions) on Linuxx86-64 CPU and NVIDIA GPU on LinuxApple M Chips on macOS (As we do not optimize for Mac, the performance improvement is not significant now.)And new features coming soon:Mistral-7B modelMetal backend for sparse inference on macOS

评论 #38714641 未加载

peter_d_shermanover 1 year ago

>"This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers."Brilliant!

modelessover 1 year ago

Everyone compares against llama.cpp because it's easy mode. Llama.cpp is slow! Everyone should know this. They should compare against exllamav2 or other optimized implementations.

评论 #38713932 未加载

评论 #38712847 未加载

评论 #38710924 未加载

评论 #38711286 未加载

superkuhover 1 year ago

This will be really cool once there's the ability to generate the sparse predictor files for arbitrary models rather than just the 4 they've done it with. Looking through the page and code it doesn't seem like the tools to do that step are included. Guess I'll wait on this one a bit. Hopefully these features will be merged back into llama.cpp as options eventually since this is based on the normal llama.cpp code (ie, not just using the ggml matrix lib).

causality0over 1 year ago

All the "consumer grade GPUs" terminology makes it seem like you could run it on a variety of models, but like so many of these posts, is this a 4090 exclusive?

评论 #38725860 未加载

nextaccounticover 1 year ago

> Hybrid CPU/GPU Utilization: Seamlessly integrates memory/computation capabilities of CPU and GPU for a balanced workload and faster processing.Does this means that it runs at same time at both CPU and GPU, being faster than a CPU-only or a GPU-only implementation on the same device?edit: when running on integrated GPUs, can this benefit from the improved communication between CPU and GPU?

评论 #38714633 未加载

PoignardAzurover 1 year ago

This sounds like it uses the same techniques as the ones described in the "LLM in a Flash" paper posted yesterday? If so, cool to see an implementation of these techniques running models on non-Apple GPUs.

robwwilliamsover 1 year ago

Scale-free network topology enables a crude but effective split of neurons into hot and cold classes—hot neurons at home on the GPU and larger numbers of cold neurons that benefit from more memory on the CPU. Clever!

ComputerGuruover 1 year ago

It’s not too much faster than exllama2 with flash attention, no?

ekianjoover 1 year ago

how much speed increase do we get on CPU only configurations? has anyone tested it in such cases?

评论 #38711042 未加载

评论 #38710333 未加载

coder543over 1 year ago

"Power*" made me think of Microsoft, so I was almost expecting this to be Windows-specific. (PowerShell, PowerPoint, Power BI, Power Apps, Power Automate... I'm probably forgetting some.)

评论 #38709815 未加载

评论 #38710912 未加载