TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Performance of llama.cpp on Apple Silicon A-series

100 点作者 mobilio超过 1 年前

7 条评论

eminence32超过 1 年前
I&#x27;ve been playing around a lot with llama.cpp recently, and it&#x27;s making me re-think my predictions for the future...<p>Given how big these models are (and the steep cost for GPUs to load them), I had been thinking that most people would interact with them via some hosted API (like what openAI is offering) or via some product like Bard or Copilot which offload inference to some big cloud datacenter.<p>But given how well some of these models perform on the CPU when quantized down to 4, 6, or 8 bits, I&#x27;m starting to think that there will be quite a few interesting applications for fully local inference on relatively modest hardware
评论 #38703799 未加载
评论 #38703757 未加载
评论 #38705442 未加载
评论 #38703997 未加载
评论 #38703663 未加载
johnklos超过 1 年前
What&#x27;s interesting is how there&#x27;s so much emphasis on high end video cards which are prohibitively expensive for most people, yet many of the newer models, when quantized, run perfectly well on CPUs. Instead of chasing speed with money, seeing what can run decently on available hardware will end up having a much bigger potential impact on a greater number of people.<p>As an experiment, I&#x27;ve been running llama.cpp on an old 2012 AMD Bulldozer system, which most people consider to be AMD&#x27;s equivalent of Intel&#x27;s Pentium 4, with 64 gigs of memory, and with newer models it&#x27;s surprisingly usable, if not entirely practical. It&#x27;s much more usable, in my opinion, than spending energy trying to get everything to fit in to more modest GPUs&#x27; smaller amounts of VRAM.<p>It certainly shows that people shouldn&#x27;t be dissuaded from playing around just because they have an older GPU and&#x2F;or a GPU without much VRAM.
cgearhart超过 1 年前
What’s the definition of “prompt processing” vs “token generation”?<p>Is that separately comparing the time it takes to preprocess the input prompt (prompt_length &#x2F; pp_token_rate = time_to_first_token) and then the token generation rate is the time for each successive token?<p>I also see something about bs batch size. Is batching relevant for a locally run model? (Usually you only have one prompt at a time, right?)
评论 #38703762 未加载
评论 #38703792 未加载
评论 #38703786 未加载
carterschonwald超过 1 年前
Llama.cpp and other “inference at the edge” tools are a really amazing pieces of engineering.
评论 #38704005 未加载
yieldcrv超过 1 年前
love that, I’ve been using Mistral 7B on my M1 and I thought it was tolerable but turned out I wasnt utilizing Metal and now its amazing<p>8x7B nowadays<p>As long as metal is used on an iphone I could see it worked well too. I use quantize 5 on my laptop but quantize 4 seems very practical
评论 #38703599 未加载
评论 #38703567 未加载
评论 #38704150 未加载
Havoc超过 1 年前
Apples stinginess with Ram in phones may come back to bite them on LLMs
评论 #38706098 未加载
m3kw9超过 1 年前
Testing performance of LLM without testing the quality isn’t really practical in real world because of its fast and output is gibrish it won’t matter.<p>There should be 10-20 input and output that is tested for correctness or something in addition to t&#x2F;s as a reference