Regarding this bit at the end:<p>> I learned how to write math kernels by renting Vast VMs and watching Gautham Venkatasubramanian and mrdomino develop CUDA kernels in a tmux session. They've been focusing on solving a much more important challenge for llamafile, which is helping it not have a mandatory dependency on the cuBLAS<p>If I'm reading this right, they're trying to rewrite cuBLAS within CUDA itself. I'm guessing the next step would be removing CUDA dependency and go with directly using Vulkan or Metal compute shaders. Am I correct?
I think it's a good idea for everyone to download and be able to run a LLM locally, even if you have the minimum of requirements. As a pseudo-backup of a large chunk of human knowledge.
There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags.<p>The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.<p>Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.<p>Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.
- <a href="https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html" rel="nofollow">https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html</a> (e.g. here <a href="https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-blocking-for-registers.html" rel="nofollow">https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-blocki...</a>)<p>- <a href="https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0" rel="nofollow">https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...</a><p>might be of interest
Strange title. My first read of the title thought the author was arguing the model is now faster on CPU than GPU. Would be much nicer if they titled this something closer to "Performance Improvement for LLaMa on CPU".
> I like to define my subroutines using a modern language like C++, which goes 47 gigaflops. This means C++ is three orders of a magnitude faster than Python. That's twenty years of progress per Moore's law.<p>This is great. I love the idea of measuring performance differences in “years of Moore’s law.”<p>Twenty years puts the delta in an easy to understand framework.
> You don't need a large computer to run a large language model<p>While running tiny llama does indeed count as running a language model, I’m skeptical that the capabilities of doing so match what most people would consider a baseline requirement to be useful.<p>Running 10 param model is also “technically” running an LM, and I can do it by hand with a piece of paper.<p>That doesn’t mean “you don’t need a computer to run an LM”…<p>I’m not sure where LM becomes LLM, but… I personally think it’s more about capability than parameter count.<p>I don’t <i>realllly</i> believe you can do a lot of useful LLM work on a pi
Pixar uses CPUs …<p>I wonder if we’ll end up in a situation like rendered movies.<p>Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).<p><a href="https://news.ycombinator.com/item?id=25616372">https://news.ycombinator.com/item?id=25616372</a>
As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement.<p>But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.
regarding AMD zen4 with avx512:<p>"Here we see that, despite only being twice the price, the 7995WX x86 ISA offers 7x more raw compute power than the M2 Ultra ARM ISA, and nearly the same token generation speed, which is likely thanks to its 384mb L3 cache. When I bought this chip, I had to expand support in llama.cpp for bfloat16 and AVX512 before I could fully test its capabilities. My work means you can now run LLaMA 2.8x faster on Zen4 than you could before."
Super nice story on the matmul optimization that gave 810 gflops for 512x512. Thanks for the write up and the contributions to llama.cpp and the community more broadly.
> One important thing to know if you're considering buying a Mac Studio is that, like the Windows Executive, XNU does a really good job keeping your desktop stable, and that means protecting your system from you. It takes me 45 seconds on Mac Studio to compile the Cosmo monorepo, due to all these safety features; but if I fork bombed it, I'd be surprised if Netflix skipped a single frame.<p>Clearly nobody actually tried this, because on XNU if you fork bomb the system it reliably goes down every single time. There are no "safety features" here but extra overhead when spawning processes.
From the example: "--temp 0 turns off the random number generator (we don't want improvisation for a spam filter)"<p>I've been thinking for a while about how many applications of LLMs need this adjustment and aren't getting it
That's interesting because I built a simple ANN library and I was playing around with GPU acceleration and came to a similar conclusion as this article.<p>To be fair, my ANN library was faster (up to 2x) with GPU acceleration in some scenarios were ANN was shallow (as opposed to deep with many hidden layers). I thought the marginal gain may have been because, the way it's set up in my library, it has to load all the values into the GPU from RAM for each pass of forward and back propagation in each layer during training. I believe there is a way to allocate memory on the GPU chip itself but it's a lot more challenging to do, especially in a modular, fully portable way (which was one of the goals of my library).<p>But anyway, even the 2x best-case figure seemed disappointing. In my mind, I expected to see at least 10x speed improvement... And I was surprised that the CPU version was actually slightly faster in the scenario I was testing at the time which was a relatively deep network. It makes sense since the different layers cannot be parallelized as the input of one layer depends on the output of the previous layer... So the more layers you have, the more serial bottlenecks you have, the less you can benefit from GPU acceleration... And unfortunately, deep networks also happen to be those which tend to perform best for a lot of use cases.
It fascinating to me that coming up on a year since Sapphire Rapids has been available in the public cloud, developers are still targeting AVX512 when they should be targeting VNNI and AMX.<p><a href="https://github.com/ggerganov/llama.cpp/issues/2555">https://github.com/ggerganov/llama.cpp/issues/2555</a>
This is great work. I've always thought it would be great if running LLM could be commoditized for regular average Joe hardware. I had thought that llamafile was like dockerfile for llama.cpp but looks like that's a mistake?<p>Will definitely be giving this a try.
A way of thinking about what's inside any of the top LLMs right now: even if they never learn another single fact, even if they get ridiculously out of date as a result, even if they are even more riddled with errors and prone to biases than we know them to be, even if they are as prone to hallucinations as we know they they are and they never develop the capacity to cure themselves of this, they are more knowledgeable and capable of more reasoned response, despite their capacity for error, to more questions than any single human being that has ever lived.
Nice to see such speedups for CPUs. Are these changes available as a branch or pull request in llama.cpp itself? I'd like to make use of them in that form if possible (as I'm used to using that).
If I'm reading the post correctly, Llamafile is faster than llama.cpp, despite the author upstreaming some of the changes. What's the reason for this?
Has Justine written anywhere about her disassembly setup?<p>> I configured Emacs so I can push a button, and the disassembly for the C++ code I'm working on will pop up on the screen in a few milliseconds.<p>I assume it's something project specific rather than being able to get the disassembly for an arbitrary section of code or something?<p>It seems very handy, so I'd love to see the implementation (I couldn't find anything googling)
> It's clearly optimal since my CPU is listed as only being capable of going 780 gigaflops<p>780 GFLOP is the iGPU spec. Is this a valid comparison?<p><a href="https://nanoreview.net/en/cpu/intel-core-i9-14900k" rel="nofollow">https://nanoreview.net/en/cpu/intel-core-i9-14900k</a>
> the Raspberry Pi<p>Odd how there were no Mistral 7 benchmarks for the Pi 5 in that table (I doubt anyone is seriously considering using TinyLlama for anything at all), so I went to re-test it out myself on the Pi 5 8G.<p>llamafile 0.7: 52 predicted, 150 cached, 430ms per token, 2.32 tokens per second<p>llama.cpp + OpenBLAS: 36 predicted, 124 cached, 381ms per token, 2.62 tokens per second<p>It does seem to inch closer to the speed you get with blas acceleration which is quite impressive, but in practical terms the Pi 5 is so heavily limited by its memory throughput bottleneck that it saturates the required compute with 3 threads already. So while fancy kernels will make it more efficient it won't really save you from that fundamental bandwidth limit. The Pi foundation messed up going with a 32 bit memory bus, simple as.
Is there somewhere an overview of the progress we made on the software side for training and inference of LLMs? It feels like we squeezed 10-100x more out of the hardware since llama appeared. This crazy progress will probably saturate though as we reach theoretical limits, no?
Is it easy to find where the matvecs are, in LLaMA (if you are someone who is curious and wants to poke around at the “engine” without understanding the “transmission,” so to speak)? I was hoping to mess around with this for Stable Diffusion, but it seemed like they were buried under quite a few layers of indirection. Which is entirely reasonable, the goal is to ship software, not satisfy people who’d just want to poke things and see what happens, haha.
Multithreading support in llama.cpp is probably still pretty busted, assuming it uses the same underlying NN inference code as whisper.cpp: <a href="https://github.com/ggerganov/whisper.cpp/issues/200#issuecomment-1484025515">https://github.com/ggerganov/whisper.cpp/issues/200#issuecom...</a>
Definitely wild we’re in the timeline you can run a 1.1 bn param model on a raspberry pi, but its still tough to justify because the 1.1 is kinda useless compared to the beefier models. Sick for home builds/hobbyists though I might wanna get one of the new Pis just to try this out
Any performance benchmark against intel's 'IPEX-LLM'[0] or others?<p>[0] - <a href="https://github.com/intel-analytics/ipex-llm">https://github.com/intel-analytics/ipex-llm</a>
"As for disk speed, dd if=/dev/zero of=/tmp/output bs=128k count=50k; rm -f /tmp/output reports 1.6 GB/s which is 3.6x slower than my Mac Studio, and 3x slower than my Intel (which has the same M.2 stick). I'm told that Intel and Apple are just better at this, but I wish I understood why. "<p>Can anyone here answer why this is?
Does someone else see llamafile using Wine on Linux?<p>Edit: After the download I did a simple chmod +x llava-v1.5-7b-q4.llamafile; ./llava-v1.5-7b-q4.llamafile
I know this post is focused specifically on <i>CPU</i> performance, but the section on the performance on the Mac Studio seems to be deliberately avoiding directly mentioning that machine's GPU, let alone benchmark against it. I think it would have been interesting to see a straightforward comparison of what compute performance and memory bandwidth (as measured by the prompt processing and token generation speeds, respectively) are achievable with reasonable optimization effort on the CPU vs GPU when they're attached to the same memory subsystem.
It would be good to see some independent verification of this claim. HN has previously [1] fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model, which should have failed a basic smell test and indeed was debunked shortly after. Justine Tunney appears to enjoy extreme superstar status here, and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation (to begin with, what other LLM developments even hit upvote numbers like the +1300ish there or the +712 here at the time of writing?).<p>[1] <a href="https://news.ycombinator.com/item?id=35393284">https://news.ycombinator.com/item?id=35393284</a>
re:funding<p>my friend suggested to nominate Justine for the open source contributions in an internal Microsoft programme (the winner takes $10k). They did not even want to add her to the potential list of nominees because her software is not used in MSFT. It speaks volumes about the corporate culture and shows what they really think about OSS support.