LLaMA now goes faster on CPUs

1372 pointsby lawrencechenabout 1 year ago

45 comments

spepsabout 1 year ago

Regarding this bit at the end:> I learned how to write math kernels by renting Vast VMs and watching Gautham Venkatasubramanian and mrdomino develop CUDA kernels in a tmux session. They've been focusing on solving a much more important challenge for llamafile, which is helping it not have a mandatory dependency on the cuBLASIf I'm reading this right, they're trying to rewrite cuBLAS within CUDA itself. I'm guessing the next step would be removing CUDA dependency and go with directly using Vulkan or Metal compute shaders. Am I correct?

评论 #39891938 未加载

评论 #39891952 未加载

评论 #39904465 未加载

评论 #39903245 未加载

bottlepalmabout 1 year ago

I think it's a good idea for everyone to download and be able to run a LLM locally, even if you have the minimum of requirements. As a pseudo-backup of a large chunk of human knowledge.

评论 #39890656 未加载

评论 #39890615 未加载

评论 #39890470 未加载

评论 #39892548 未加载

评论 #39890818 未加载

评论 #39890546 未加载

评论 #39890739 未加载

评论 #39899411 未加载

评论 #39896840 未加载

评论 #39890507 未加载

评论 #39891250 未加载

评论 #39891923 未加载

评论 #39891011 未加载

评论 #39890772 未加载

marshallwardabout 1 year ago

There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags.The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.

评论 #39894636 未加载

评论 #39896349 未加载

评论 #39894984 未加载

ajtullochabout 1 year ago

- <a href="https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html" rel="nofollow">https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html</a> (e.g. here <a href="https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-blocking-for-registers.html" rel="nofollow">https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-blocki...</a>)- <a href="https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0" rel="nofollow">https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...</a>might be of interest

评论 #39890777 未加载

TimPCabout 1 year ago

Strange title. My first read of the title thought the author was arguing the model is now faster on CPU than GPU. Would be much nicer if they titled this something closer to "Performance Improvement for LLaMa on CPU".

评论 #39899459 未加载

aaronscottabout 1 year ago

> I like to define my subroutines using a modern language like C++, which goes 47 gigaflops. This means C++ is three orders of a magnitude faster than Python. That's twenty years of progress per Moore's law.This is great. I love the idea of measuring performance differences in “years of Moore’s law.”Twenty years puts the delta in an easy to understand framework.

评论 #39896704 未加载

wokwokwokabout 1 year ago

> You don't need a large computer to run a large language modelWhile running tiny llama does indeed count as running a language model, I’m skeptical that the capabilities of doing so match what most people would consider a baseline requirement to be useful.Running 10 param model is also “technically” running an LM, and I can do it by hand with a piece of paper.That doesn’t mean “you don’t need a computer to run an LM”…I’m not sure where LM becomes LLM, but… I personally think it’s more about capability than parameter count.I don’t realllly believe you can do a lot of useful LLM work on a pi

评论 #39890860 未加载

评论 #39891672 未加载

评论 #39895120 未加载

评论 #39892159 未加载

tiffanyhabout 1 year ago

Pixar uses CPUs …I wonder if we’ll end up in a situation like rendered movies.Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).<a href="https://news.ycombinator.com/item?id=25616372">https://news.ycombinator.com/item?id=25616372</a>

评论 #39900237 未加载

评论 #39893668 未加载

评论 #39893646 未加载

ein0pabout 1 year ago

As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement.But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.

评论 #39898386 未加载

AbuAssarabout 1 year ago

regarding AMD zen4 with avx512:"Here we see that, despite only being twice the price, the 7995WX x86 ISA offers 7x more raw compute power than the M2 Ultra ARM ISA, and nearly the same token generation speed, which is likely thanks to its 384mb L3 cache. When I bought this chip, I had to expand support in llama.cpp for bfloat16 and AVX512 before I could fully test its capabilities. My work means you can now run LLaMA 2.8x faster on Zen4 than you could before."

评论 #39893525 未加载

pamaabout 1 year ago

Super nice story on the matmul optimization that gave 810 gflops for 512x512. Thanks for the write up and the contributions to llama.cpp and the community more broadly.

saagarjhaabout 1 year ago

> One important thing to know if you're considering buying a Mac Studio is that, like the Windows Executive, XNU does a really good job keeping your desktop stable, and that means protecting your system from you. It takes me 45 seconds on Mac Studio to compile the Cosmo monorepo, due to all these safety features; but if I fork bombed it, I'd be surprised if Netflix skipped a single frame.Clearly nobody actually tried this, because on XNU if you fork bomb the system it reliably goes down every single time. There are no "safety features" here but extra overhead when spawning processes.

none_to_remainabout 1 year ago

From the example: "--temp 0 turns off the random number generator (we don't want improvisation for a spam filter)"I've been thinking for a while about how many applications of LLMs need this adjustment and aren't getting it

评论 #39892268 未加载

评论 #39890973 未加载

jongjongabout 1 year ago

That's interesting because I built a simple ANN library and I was playing around with GPU acceleration and came to a similar conclusion as this article.To be fair, my ANN library was faster (up to 2x) with GPU acceleration in some scenarios were ANN was shallow (as opposed to deep with many hidden layers). I thought the marginal gain may have been because, the way it's set up in my library, it has to load all the values into the GPU from RAM for each pass of forward and back propagation in each layer during training. I believe there is a way to allocate memory on the GPU chip itself but it's a lot more challenging to do, especially in a modular, fully portable way (which was one of the goals of my library).But anyway, even the 2x best-case figure seemed disappointing. In my mind, I expected to see at least 10x speed improvement... And I was surprised that the CPU version was actually slightly faster in the scenario I was testing at the time which was a relatively deep network. It makes sense since the different layers cannot be parallelized as the input of one layer depends on the output of the previous layer... So the more layers you have, the more serial bottlenecks you have, the less you can benefit from GPU acceleration... And unfortunately, deep networks also happen to be those which tend to perform best for a lot of use cases.

kiratpabout 1 year ago

It fascinating to me that coming up on a year since Sapphire Rapids has been available in the public cloud, developers are still targeting AVX512 when they should be targeting VNNI and AMX.<a href="https://github.com/ggerganov/llama.cpp/issues/2555">https://github.com/ggerganov/llama.cpp/issues/2555</a>

评论 #39890653 未加载

评论 #39890886 未加载

评论 #39891742 未加载

评论 #39891196 未加载

politelemonabout 1 year ago

This is great work. I've always thought it would be great if running LLM could be commoditized for regular average Joe hardware. I had thought that llamafile was like dockerfile for llama.cpp but looks like that's a mistake?Will definitely be giving this a try.

aniijbodabout 1 year ago

A way of thinking about what's inside any of the top LLMs right now: even if they never learn another single fact, even if they get ridiculously out of date as a result, even if they are even more riddled with errors and prone to biases than we know them to be, even if they are as prone to hallucinations as we know they they are and they never develop the capacity to cure themselves of this, they are more knowledgeable and capable of more reasoned response, despite their capacity for error, to more questions than any single human being that has ever lived.

评论 #39891806 未加载

评论 #39890630 未加载

评论 #39896079 未加载

kristianpabout 1 year ago

Nice to see such speedups for CPUs. Are these changes available as a branch or pull request in llama.cpp itself? I'd like to make use of them in that form if possible (as I'm used to using that).

评论 #39891814 未加载

s_Hoggabout 1 year ago

I'd pay good money to watch jart in conversation with Carmack

评论 #39900141 未加载

评论 #39899793 未加载

评论 #39894208 未加载

miki123211about 1 year ago

If I'm reading the post correctly, Llamafile is faster than llama.cpp, despite the author upstreaming some of the changes. What's the reason for this?

mijoharasabout 1 year ago

Has Justine written anywhere about her disassembly setup?> I configured Emacs so I can push a button, and the disassembly for the C++ code I'm working on will pop up on the screen in a few milliseconds.I assume it's something project specific rather than being able to get the disassembly for an arbitrary section of code or something?It seems very handy, so I'd love to see the implementation (I couldn't find anything googling)

评论 #39892954 未加载

hrkfmud50kabout 1 year ago

> It's clearly optimal since my CPU is listed as only being capable of going 780 gigaflops780 GFLOP is the iGPU spec. Is this a valid comparison?<a href="https://nanoreview.net/en/cpu/intel-core-i9-14900k" rel="nofollow">https://nanoreview.net/en/cpu/intel-core-i9-14900k</a>

moffkalastabout 1 year ago

> the Raspberry PiOdd how there were no Mistral 7 benchmarks for the Pi 5 in that table (I doubt anyone is seriously considering using TinyLlama for anything at all), so I went to re-test it out myself on the Pi 5 8G.llamafile 0.7: 52 predicted, 150 cached, 430ms per token, 2.32 tokens per secondllama.cpp + OpenBLAS: 36 predicted, 124 cached, 381ms per token, 2.62 tokens per secondIt does seem to inch closer to the speed you get with blas acceleration which is quite impressive, but in practical terms the Pi 5 is so heavily limited by its memory throughput bottleneck that it saturates the required compute with 3 threads already. So while fancy kernels will make it more efficient it won't really save you from that fundamental bandwidth limit. The Pi foundation messed up going with a 32 bit memory bus, simple as.

isusmeljabout 1 year ago

Is there somewhere an overview of the progress we made on the software side for training and inference of LLMs? It feels like we squeezed 10-100x more out of the hardware since llama appeared. This crazy progress will probably saturate though as we reach theoretical limits, no?

1-6about 1 year ago

Question is, how much of an improvement has it gotten to over a GPU or ASIC?

评论 #39891224 未加载

评论 #39890904 未加载

评论 #39890616 未加载

评论 #39891758 未加载

评论 #39892763 未加载

bee_riderabout 1 year ago

Is it easy to find where the matvecs are, in LLaMA (if you are someone who is curious and wants to poke around at the “engine” without understanding the “transmission,” so to speak)? I was hoping to mess around with this for Stable Diffusion, but it seemed like they were buried under quite a few layers of indirection. Which is entirely reasonable, the goal is to ship software, not satisfy people who’d just want to poke things and see what happens, haha.

评论 #39891042 未加载

columnabout 1 year ago

Unfortunately BitDefender (corporate) blocks llamafile as a ransomware "atc.heur.crypt" and it seems there is no workaround. :(

评论 #39907998 未加载

Ono-Sendaiabout 1 year ago

Multithreading support in llama.cpp is probably still pretty busted, assuming it uses the same underlying NN inference code as whisper.cpp: <a href="https://github.com/ggerganov/whisper.cpp/issues/200#issuecomment-1484025515">https://github.com/ggerganov/whisper.cpp/issues/200#issuecom...</a>

评论 #39891848 未加载

rbnslabout 1 year ago

Definitely wild we’re in the timeline you can run a 1.1 bn param model on a raspberry pi, but its still tough to justify because the 1.1 is kinda useless compared to the beefier models. Sick for home builds/hobbyists though I might wanna get one of the new Pis just to try this out

DrNosferatuabout 1 year ago

Any performance benchmark against intel's 'IPEX-LLM'[0] or others?[0] - <a href="https://github.com/intel-analytics/ipex-llm">https://github.com/intel-analytics/ipex-llm</a>

yieldcrvabout 1 year ago

note, this is "goes faster on CPUs than before", not faster than GPUs.

Dobiasdabout 1 year ago

Are there any benchmarks on the performance of these new matrix multiplication kernels compared to the Eigen library (ideally for float32)?

评论 #39927490 未加载

discordanceabout 1 year ago

"As for disk speed, dd if=/dev/zero of=/tmp/output bs=128k count=50k; rm -f /tmp/output reports 1.6 GB/s which is 3.6x slower than my Mac Studio, and 3x slower than my Intel (which has the same M.2 stick). I'm told that Intel and Apple are just better at this, but I wish I understood why. "Can anyone here answer why this is?

评论 #39890752 未加载

评论 #39890853 未加载

arendtioabout 1 year ago

Does someone else see llamafile using Wine on Linux?Edit: After the download I did a simple chmod +x llava-v1.5-7b-q4.llamafile; ./llava-v1.5-7b-q4.llamafile

评论 #39896987 未加载

seangroggabout 1 year ago

Mmm, I wonder how well this would work on a mobile device. Maybe I'll try grabbing my ubuntu touch here in a sec...

评论 #39892950 未加载

m3kw9about 1 year ago

So Nvidia in trouble now because intel can be used instead for faster/cheaper? inference?

6r17about 1 year ago

today being today ; I must ask ; anyone has actually tried this ?

JohnnyHerzabout 1 year ago

Awesomeness. thank you for sharing!

tubsabout 1 year ago

The ram is not on the cpu on a mac. It's in the same can but it's still regular ddr dimms.

aimonster2about 1 year ago

Posted too early.

wtallisabout 1 year ago

I know this post is focused specifically on CPU performance, but the section on the performance on the Mac Studio seems to be deliberately avoiding directly mentioning that machine's GPU, let alone benchmark against it. I think it would have been interesting to see a straightforward comparison of what compute performance and memory bandwidth (as measured by the prompt processing and token generation speeds, respectively) are achievable with reasonable optimization effort on the CPU vs GPU when they're attached to the same memory subsystem.

4bppabout 1 year ago

It would be good to see some independent verification of this claim. HN has previously [1] fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model, which should have failed a basic smell test and indeed was debunked shortly after. Justine Tunney appears to enjoy extreme superstar status here, and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation (to begin with, what other LLM developments even hit upvote numbers like the +1300ish there or the +712 here at the time of writing?).[1] <a href="https://news.ycombinator.com/item?id=35393284">https://news.ycombinator.com/item?id=35393284</a>

评论 #39893847 未加载

评论 #39894552 未加载

评论 #39894591 未加载

评论 #39894270 未加载

评论 #39894122 未加载

评论 #39894502 未加载

评论 #39893932 未加载

评论 #39904887 未加载

评论 #39894801 未加载

评论 #39895378 未加载

pknerdabout 1 year ago

So, I can now run it on my 2015 Macbook with 8GB RAM?

sublimefireabout 1 year ago

re:fundingmy friend suggested to nominate Justine for the open source contributions in an internal Microsoft programme (the winner takes $10k). They did not even want to add her to the potential list of nominees because her software is not used in MSFT. It speaks volumes about the corporate culture and shows what they really think about OSS support.

评论 #39893425 未加载

tompabout 1 year ago

TL;DR: unroll the outer two loops of matrix multiplication

评论 #39893976 未加载