TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: How is Llama-2 optimized so much?

3 点作者 gorenb将近 2 年前
Just ran Llama-2 (without a GPU) and it gave me coherent responses in 3 minutes (which is extremely fast for no GPU). How does this work?

3 条评论

frankacter将近 2 年前
A couple of things.<p>1) Quantization. Llama-2 can be quantized, which means that the weights of the model are represented as integers instead of floating point numbers. This makes the model much smaller and easier to store in memory, and it also makes the model faster to run just using the CPU for environments like yours where a GPU is not available.<p>2) Grouped-query attention. Llama-2 uses a technique called grouped-query attention, which allows the model to focus on smaller parts of the input text at a time. This makes the model more efficient and it also allows the model to run faster on just the CPU.<p>3) Optimized implementation. The implementation of Llama-2 is optimized for speed on CPUs. This includes using efficient algorithms and data structures, and it also includes using compiler optimizations.<p>The 70B version of Llama-2 can generate text at a rate of 100 tokens per second on a CPU as an example.
评论 #37029482 未加载
brucethemoose2将近 2 年前
&gt; Just ran Llama-2 (without a GPU) and it gave me coherent responses in 3 minutes (which is extremely fast for no GPU). How does this work?<p>It should be much faster with llama.cpp. My old-ish laptop CPU (AMD 4900HS) can ingest a big prompt reasonably quickly and then stream text fast enough to (slowly) read.<p>If you have any kind of dGPU, even a small laptop one, prompt ingestion is dramatically faster.<p>Try the latest Kobold release: <a href="https:&#x2F;&#x2F;github.com&#x2F;LostRuins&#x2F;koboldcpp">https:&#x2F;&#x2F;github.com&#x2F;LostRuins&#x2F;koboldcpp</a><p>But to answer your question, the GGML CPU implementation is very good, and actually generating the response is somewhat serial, and more RAM speed bound than compute bound.
gorenb将近 2 年前
Thank you for the responses from frankacter and brucethemoose2.