TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Vision Now Available in Llama.cpp

544 点作者 redman255 天前

21 条评论

dust424 天前
To add some numbers, on MBP M1 64GB with ggml-org&#x2F;gemma-3-4b-it-GGUF I get<p><pre><code> 25t&#x2F;s prompt processing 63t&#x2F;s token generation </code></pre> Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.<p>Steps to reproduce:<p><pre><code> git clone https:&#x2F;&#x2F;github.com&#x2F;ggml-org&#x2F;llama.cpp.git cmake -B build cmake --build build --config Release -j 12 --clean-first # download model and mmproj files... build&#x2F;bin&#x2F;llama-server \ --model gemma-3-4b-it-Q4_K_M.gguf \ --mmproj mmproj-model-f16.gguf </code></pre> Then open <a href="http:&#x2F;&#x2F;127.0.0.1:8080&#x2F;" rel="nofollow">http:&#x2F;&#x2F;127.0.0.1:8080&#x2F;</a> for the web interface<p>Note: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.<p>I have used the official ggml-org&#x2F;gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.
评论 #43946524 未加载
评论 #43948544 未加载
评论 #43945159 未加载
评论 #43949471 未加载
评论 #43946952 未加载
danielhanchen4 天前
It works super well!<p>You&#x27;ll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.<p>I made some quants with vision support - literally run:<p>.&#x2F;llama.cpp&#x2F;llama-mtmd-cli -hf unsloth&#x2F;gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1<p>.&#x2F;llama.cpp&#x2F;llama-mtmd-cli -hf unsloth&#x2F;gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1<p>.&#x2F;llama.cpp&#x2F;llama-mtmd-cli -hf unsloth&#x2F;gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1<p>.&#x2F;llama.cpp&#x2F;llama-mtmd-cli -hf unsloth&#x2F;unsloth&#x2F;Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1<p>Then load the image with &#x2F;image image.png inside the chat, and chat away!<p>EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.
评论 #43943424 未加载
评论 #43943493 未加载
评论 #43943415 未加载
评论 #43944508 未加载
评论 #43943545 未加载
ngxson4 天前
We also support SmolVLM series which delivers light-speed response thanks to its mini size!<p>This is perfect for real-time home video surveillance system. That&#x27;s one of the ideas for my next hobby project!<p><pre><code> llama-server -hf ggml-org&#x2F;SmolVLM-Instruct-GGUF llama-server -hf ggml-org&#x2F;SmolVLM-256M-Instruct-GGUF llama-server -hf ggml-org&#x2F;SmolVLM-500M-Instruct-GGUF llama-server -hf ggml-org&#x2F;SmolVLM2-2.2B-Instruct-GGUF llama-server -hf ggml-org&#x2F;SmolVLM2-256M-Video-Instruct-GGUF llama-server -hf ggml-org&#x2F;SmolVLM2-500M-Video-Instruct-GGUF</code></pre>
评论 #43944005 未加载
评论 #43944933 未加载
评论 #43945225 未加载
simonw5 天前
This is the most useful documentation I&#x27;ve found so far to help understand how this works: <a href="https:&#x2F;&#x2F;github.com&#x2F;ggml-org&#x2F;llama.cpp&#x2F;tree&#x2F;master&#x2F;tools&#x2F;mtmd#multimodal-support-in-llamacpp">https:&#x2F;&#x2F;github.com&#x2F;ggml-org&#x2F;llama.cpp&#x2F;tree&#x2F;master&#x2F;tools&#x2F;mtmd...</a>
评论 #43944874 未加载
banana_giraffe4 天前
I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken.<p>Very nice for something that&#x27;s self hosted.
评论 #43943478 未加载
评论 #43945968 未加载
simonw4 天前
llama.cpp offers compiled releases for multiple platforms. This release has the new vision features: <a href="https:&#x2F;&#x2F;github.com&#x2F;ggml-org&#x2F;llama.cpp&#x2F;releases&#x2F;tag&#x2F;b5332">https:&#x2F;&#x2F;github.com&#x2F;ggml-org&#x2F;llama.cpp&#x2F;releases&#x2F;tag&#x2F;b5332</a><p>On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:<p><pre><code> unzip llama-b5332-bin-macos-arm64.zip cd build&#x2F;bin sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib </code></pre> Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43943370R">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43943370R</a>)<p><pre><code> .&#x2F;llama-mtmd-cli -hf unsloth&#x2F;gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99 </code></pre> Or start the localhost 8080 web server (with a UI and API) like this:<p><pre><code> .&#x2F;llama-server -hf unsloth&#x2F;gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99 </code></pre> I wrote up some more detailed notes here: <a href="https:&#x2F;&#x2F;simonwillison.net&#x2F;2025&#x2F;May&#x2F;10&#x2F;llama-cpp-vision&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simonwillison.net&#x2F;2025&#x2F;May&#x2F;10&#x2F;llama-cpp-vision&#x2F;</a>
评论 #43943650 未加载
评论 #43943667 未加载
评论 #43943662 未加载
thenthenthen4 天前
What has changed in laymans terms? I tried llama.cpp a few months ago and it could already do image description etc?
nico5 天前
How does this compare to using a multimodal model like gemma3 via ollama?<p>Any benefit on a Mac with apple silicon? Any experiences someone could share?
评论 #43943560 未加载
gitroom4 天前
Man, the ngl abbreviation gets me every time too. Kinda cool seeing all the tweaks folks do to make this stuff run faster on their Macs. You think models hitting these speed boosts will mean more people start playing with vision stuff at home?
评论 #43945988 未加载
dr_kiszonka4 天前
Are there any tools that leverage vision for UI development?<p>Use case: I am working on a hobby project that uses TS&#x2F;React as frontend. I can use local or cloud LLMs in VSCode but even those with vision require that I take a screenshot and paste it to a chat. Ideally, I would want it all automated until some stop criterion is met (even if only n-iterations). But even an extension that would screenshot a preview and paste it to chat (triggered by a keyboard shortcut) would be a big time-saver.
a_e_k4 天前
This is excellent. I&#x27;ve been pulling and rebuilding periodically, and watching the commit notes as they (mostly ngxson, I think) first added more vision models, each with their own CLI program, then unified those under a single CLI program and deprecated the standalone one, while bug fixing and improving the image processing. I&#x27;d been hoping that meant they&#x27;d eventually add support to the server again, and now it&#x27;s here! Thanks!
gryfft5 天前
Seems like another step change. The first time I ran a local LLM on my phone and carried on a fairly coherent conversation, I imagined edge inference would take off really quickly at least with e.g. personal assistant&#x2F;&quot;digital waifu&quot; business cases. I wonder what the next wave of apps built on Llama.cpp and its downstream technologies will do to the global economy in the next three months.
评论 #43943293 未加载
yieldcrv4 天前
Finally! Open source multimodal is so far behind closed source options that people don’t even try to benchmark<p>They’re still doing text and math tests on every new model because it’s so bad
behnamoh4 天前
didn&#x27;t llama.cpp use to have vision support last year or so?
评论 #43943381 未加载
评论 #43943678 未加载
jacooper4 天前
Is it possible to run multimodal LLMs using their Vulkan backend? I have a ton of 4gb gpus laying around that only support vulkan.
评论 #43944735 未加载
buyucu4 天前
It was really sad when vision was removed back a while ago. It&#x27;s great to see it restored. Many thanks to everyone involved!
mrs69694 天前
so image processing there but image generation isn&#x27;t ?<p>just trying to understand, awesome work so far.
评论 #43944257 未加载
评论 #43944202 未加载
评论 #43944104 未加载
bsaul4 天前
great news ! sidenote : Does vision include the ability to read a pdf ?
评论 #43944847 未加载
nurettin4 天前
Didn&#x27;t we already have vision via llava?
评论 #43944062 未加载
nikolayasdf1234 天前
finally! very important use-case! glad they added it!
babuloseo4 天前
Someone ELI5 please or tldr