TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

FastVLM: Efficient vision encoding for vision language models

367 点作者 nhod3 天前

19 条评论

insane_dreamer3 天前
As the father of a young child whose optic nerves are highly deteriorated (compression) and is expected to lose his sight (when exactly is unknown; based on original projections he should be blind by now, but an experimental treatment run in a trial at the NIH (KEEP FUNDING SCIENCE) has stabilized his sight), I'm overjoyed with the advances being made in VLMs. I can now envision a future where even if he loses his sight he'll be able to interact with the world around him, go to college, have a fulfilling career (he loves science and engineering, and is talented for his young age), etc.
评论 #43969764 未加载
nikolayasdf1233 天前
2GB for 0.5B smallest model. it does not make sense for each app to download this. apple must have plans to pre-load these models on os level and expose SDK for all apps to call these models locally. exciting times!<p>opened issue for them to confirm this: <a href="https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;ml-fastvlm&#x2F;issues&#x2F;7">https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;ml-fastvlm&#x2F;issues&#x2F;7</a>
评论 #43974212 未加载
评论 #43972589 未加载
评论 #43971097 未加载
评论 #43975748 未加载
liamwire3 天前
It feels like this is the required level of speed-up needed re. time-to-first-token to make continuous vision useful for on-device applications like an assistant that can see and take action on your screen, ala the original Apple Intelligence demos. It’s very impressive seeing the app in the repo and I’m excited to build it tonight and play around.
Aeroi3 天前
I built&#x2F;building a realtime voice+vision app called Sen, its currently live in beta and streams frames over webrtc. It&#x27;s fast and smart, but Im super curious to see how these models do as we get closer to the metal. I can see these running on-device in the future with super fast ttfb.
评论 #43969581 未加载
d3k3 天前
Very nice! I wish they were more keen to contribute to AI&#x2F;ML community an publish weights and model definition on HuggingFace. Funny enough I have just seen today a similar demo that is using a freely available VLM: <a href="https:&#x2F;&#x2F;github.com&#x2F;ngxson&#x2F;smolvlm-realtime-webcam">https:&#x2F;&#x2F;github.com&#x2F;ngxson&#x2F;smolvlm-realtime-webcam</a>
评论 #43971752 未加载
porphyra3 天前
It seems that the future of robotics is VLA models. Even Tesla FSD is an end-to-end VLA model. Efficient vision encoding will be a huge part of making robots safe and responsive.
labadal3 天前
I&#x27;m absolutely thrilled that there is an effort to make models smaller and run with less resources instead of blindly throwing more resources at the problem and expecting it to get solved.
nine_k3 天前
With that, a really helpful aid for blind people can be made, running just on their phone, fed from a camera in their eyeglasses. Somebody who could not move around without an assistant could become autonomous in daily life.
评论 #43975013 未加载
lynx973 天前
I wonder, can I convert&#x2F;run this with llama.cpp? It being LLaVA based seems promising.
BryanLegend3 天前
Seems like the main thing holding these new minds back is being able to see well. Breakthroughs like this will fix that.
评论 #43969114 未加载
nikolayasdf1233 天前
distributing this heavy compute and moving it close to device where 1. source of data happens; 2. decision and output about the result of analysis is done; is way to go. super low latency, no network traffic, privacy, less overhead in cloud. this is amazing
adamsiem3 天前
Anyone using vision to parse screenshots? QVQ was too slow. Will give this a shot.
评论 #43969412 未加载
评论 #43969402 未加载
nikolayasdf1233 天前
google and cloud LLM providers must be biting their teeth now! haha
buyucu3 天前
where is my gguf?
turnsout3 天前
Apple has gotten a slow start in the LLM world, but they have the only long term strategy that makes sense. They’re going to dominate the 2030s.
评论 #43969562 未加载
评论 #43969199 未加载
评论 #43973102 未加载
评论 #43969203 未加载
vessenes3 天前
Um wow. The on-device realtime videos are worth a watch, and compelling. Looking forward to this being deployed and widely adopted. Getting much faster time to first token opens up a ton of features and usability benefits.
vFunct3 天前
Can it fill a wine glass to the rim?
评论 #43969297 未加载
simianparrot3 天前
I have a feeling feeding tesseract the image every 1 second would be significantly faster and take far less space and processing power? Haven&#x27;t tested it yet but given how fast tesseract is on large images, it wouldn&#x27;t surprise me.
评论 #43971181 未加载
kamranjon3 天前
Apple out here playing 5d chess, installing neural cores in their hardware and writing crazy efficient vision models to run on em. Cool stuff.
评论 #43969204 未加载