科技回声

7 条评论

eminence32超过 1 年前

I've been playing around a lot with llama.cpp recently, and it's making me re-think my predictions for the future...Given how big these models are (and the steep cost for GPUs to load them), I had been thinking that most people would interact with them via some hosted API (like what openAI is offering) or via some product like Bard or Copilot which offload inference to some big cloud datacenter.But given how well some of these models perform on the CPU when quantized down to 4, 6, or 8 bits, I'm starting to think that there will be quite a few interesting applications for fully local inference on relatively modest hardware

评论 #38703799 未加载

评论 #38703757 未加载

评论 #38705442 未加载

评论 #38703997 未加载

评论 #38703663 未加载

johnklos超过 1 年前

What's interesting is how there's so much emphasis on high end video cards which are prohibitively expensive for most people, yet many of the newer models, when quantized, run perfectly well on CPUs. Instead of chasing speed with money, seeing what can run decently on available hardware will end up having a much bigger potential impact on a greater number of people.As an experiment, I've been running llama.cpp on an old 2012 AMD Bulldozer system, which most people consider to be AMD's equivalent of Intel's Pentium 4, with 64 gigs of memory, and with newer models it's surprisingly usable, if not entirely practical. It's much more usable, in my opinion, than spending energy trying to get everything to fit in to more modest GPUs' smaller amounts of VRAM.It certainly shows that people shouldn't be dissuaded from playing around just because they have an older GPU and/or a GPU without much VRAM.

cgearhart超过 1 年前

What’s the definition of “prompt processing” vs “token generation”?Is that separately comparing the time it takes to preprocess the input prompt (prompt_length / pp_token_rate = time_to_first_token) and then the token generation rate is the time for each successive token?I also see something about bs batch size. Is batching relevant for a locally run model? (Usually you only have one prompt at a time, right?)

评论 #38703762 未加载

评论 #38703792 未加载

评论 #38703786 未加载

carterschonwald超过 1 年前

Llama.cpp and other “inference at the edge” tools are a really amazing pieces of engineering.

评论 #38704005 未加载

yieldcrv超过 1 年前

love that, I’ve been using Mistral 7B on my M1 and I thought it was tolerable but turned out I wasnt utilizing Metal and now its amazing8x7B nowadaysAs long as metal is used on an iphone I could see it worked well too. I use quantize 5 on my laptop but quantize 4 seems very practical

评论 #38703599 未加载

评论 #38703567 未加载

评论 #38704150 未加载

Havoc超过 1 年前

Apples stinginess with Ram in phones may come back to bite them on LLMs

评论 #38706098 未加载

m3kw9超过 1 年前

Testing performance of LLM without testing the quality isn’t really practical in real world because of its fast and output is gibrish it won’t matter.There should be 10-20 input and output that is tested for correctness or something in addition to t/s as a reference

7 条评论

eminence32超过 1 年前

评论 #38703799 未加载

评论 #38703757 未加载

评论 #38705442 未加载

评论 #38703997 未加载

评论 #38703663 未加载

johnklos超过 1 年前

cgearhart超过 1 年前

评论 #38703762 未加载

评论 #38703792 未加载

评论 #38703786 未加载

carterschonwald超过 1 年前

Llama.cpp and other “inference at the edge” tools are a really amazing pieces of engineering.

Performance of llama.cpp on Apple Silicon A-series

7 条评论

Performance of llama.cpp on Apple Silicon A-series

7 条评论