A couple of things.<p>1) Quantization. Llama-2 can be quantized, which means that the weights of the model are represented as integers instead of floating point numbers. This makes the model much smaller and easier to store in memory, and it also makes the model faster to run just using the CPU for environments like yours where a GPU is not available.<p>2) Grouped-query attention. Llama-2 uses a technique called grouped-query attention, which allows the model to focus on smaller parts of the input text at a time. This makes the model more efficient and it also allows the model to run faster on just the CPU.<p>3) Optimized implementation. The implementation of Llama-2 is optimized for speed on CPUs. This includes using efficient algorithms and data structures, and it also includes using compiler optimizations.<p>The 70B version of Llama-2 can generate text at a rate of 100 tokens per second on a CPU as an example.
> Just ran Llama-2 (without a GPU) and it gave me coherent responses in 3 minutes (which is extremely fast for no GPU). How does this work?<p>It should be much faster with llama.cpp. My old-ish laptop CPU (AMD 4900HS) can ingest a big prompt reasonably quickly and then stream text fast enough to (slowly) read.<p>If you have any kind of dGPU, even a small laptop one, prompt ingestion is dramatically faster.<p>Try the latest Kobold release: <a href="https://github.com/LostRuins/koboldcpp">https://github.com/LostRuins/koboldcpp</a><p>But to answer your question, the GGML CPU implementation is very good, and actually generating the response is somewhat serial, and more RAM speed bound than compute bound.