Quantized Llama models with increased speed and a reduced memory footprint

508 点作者 egnehots7 个月前

18 条评论

tveita7 个月前

So SpinQuant learns a rotation for activations and weights that, to my understanding, "smear" the outlier weights out so you don't get extreme values in any one weight.Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates. Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.[1] <a href="https://ieeexplore.ieee.org/abstract/document/6296665" rel="nofollow">https://ieeexplore.ieee.org/abstract/document/6296665</a> / <a href="https://slazebni.cs.illinois.edu/publications/ITQ.pdf" rel="nofollow">https://slazebni.cs.illinois.edu/publications/ITQ.pdf</a>

评论 #41941325 未加载

评论 #41941741 未加载

评论 #41940862 未加载

评论 #41940931 未加载

评论 #41943580 未加载

评论 #41946734 未加载

nisten7 个月前

It's pretty interesting that the new SpinQuant method did not manage to be better than good old nf4bit QLORA training (Tim Dettmers really cooked with that one).Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.

评论 #41943699 未加载

评论 #41940657 未加载

评论 #41942160 未加载

评论 #41944623 未加载

评论 #41942314 未加载

评论 #41941192 未加载

theanonymousone7 个月前

May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :/

评论 #41940038 未加载

评论 #41939668 未加载

评论 #41940202 未加载

评论 #41940328 未加载

评论 #41939835 未加载

评论 #42034912 未加载

评论 #41940125 未加载

评论 #41942328 未加载

评论 #41941603 未加载

评论 #41945839 未加载

评论 #41940410 未加载

formalsystem7 个月前

Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!

评论 #41940616 未加载

评论 #41942068 未加载

评论 #41941963 未加载

philipkglass7 个月前

These quantized models show much less degradation compared to a "vanilla post-training-quantization" but there are a bunch of PTQ schemes that people have already applied to Llama models [1]. I didn't see any details about the vanilla PTQ they used as a baseline. Has it been written about elsewhere?[1] <a href="https://ollama.com/library/llama3.2/tags">https://ollama.com/library/llama3.2/tags</a>

yuvalr17 个月前

Looking at how to deploy 1B and 3B Llama models on Android for inference. Some posts online recommend using Termux (an amazing app) to have an emulated shell and then install as if it's Linux, using ollama for example. However, this forces you into a manual installation process, and also most of the people don't know what Termux is, and would be afraid to install it from F-Droid.Maybe someone can recommend a way to deploy Llama to Android without Termux, maybe even something that can be potentially fully implemented inside an app?I'm currently looking into compiling llama.cpp for Android and bundling it inside an app. Is that a viable path? Would love to hear from someone who tried something similar.

评论 #41946735 未加载

评论 #41969207 未加载

评论 #41946812 未加载

评论 #41947630 未加载

cmsj7 个月前

It really bugs me that every time I see posts about new models, there is never any indication of how much VRAM one needs to actually run them.

评论 #41943621 未加载

ed7 个月前

Oh cool! I’ve been playing with quantized llama 3B for the last week. (4-bit spinquant). The code for spinquant has been public for a bit.It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.

评论 #41944409 未加载

Evidlo7 个月前

Why don't they actually say what the size of the model is in GB?That and average inference times on common hardware is what I'm curious about.

评论 #41942261 未加载

itsTyrion6 个月前

Wait, so I can get incorrect information and text summaries with things added or cut off even faster and on mobile now? that's amazing.

nikolayasdf1237 个月前

what's your opinion on LlamaStack?for me it is nothing short of bad experience. it is way over-engineered with poor quality and just plain does not work, and maintainers are questionable. I would rather call HuggingFace python code for inference or anything else.is ExecuTorch any better?

评论 #41947696 未加载

Tepix7 个月前

From TFA:> At Connect 2024 last month, we open sourced Llama 3.2 1B and 3BNo you did not. There is no source (in this case: training data) included. Stop changing the meaning of "open source", Meta!

justanotheratom7 个月前

Any pointers no how to finetune this on my dataset and package and run it in my swift ios app?

behnamoh7 个月前

Does anyone know why the most common method to speed up inference time is quantization? I keep hearing about all sorts of new methods but nearly none of them is implemented in practice (except for flash attention).

评论 #41943096 未加载

评论 #41940570 未加载

评论 #41940263 未加载

评论 #41941818 未加载

EliBullockPapa7 个月前

Anyone know a nice iOS app to run these locally?

评论 #41939828 未加载

评论 #41939702 未加载

评论 #41940483 未加载

评论 #41940878 未加载

评论 #41939884 未加载

arnaudsm7 个月前

How do they compare to their original quants on ollama like q4_K_S?

评论 #41939226 未加载

newfocogi7 个月前

TLDR: Quantized versions of Llama 3.2 1B and 3B models with "competitive accuracy" to the original versions (meaning some degraded performance; plots included in the release notes).

评论 #41939012 未加载

评论 #41944844 未加载

mmaunder7 个月前

[flagged]

评论 #41939813 未加载

评论 #41940765 未加载

评论 #41939784 未加载

评论 #41939844 未加载

评论 #41939969 未加载

评论 #41941956 未加载

评论 #41939827 未加载

评论 #41939824 未加载

评论 #41940325 未加载