TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

FastVLM: Efficient vision encoding for vision language models

367 pointsby nhod8 days ago

19 comments

insane_dreamer8 days ago
As the father of a young child whose optic nerves are highly deteriorated (compression) and is expected to lose his sight (when exactly is unknown; based on original projections he should be blind by now, but an experimental treatment run in a trial at the NIH (KEEP FUNDING SCIENCE) has stabilized his sight), I'm overjoyed with the advances being made in VLMs. I can now envision a future where even if he loses his sight he'll be able to interact with the world around him, go to college, have a fulfilling career (he loves science and engineering, and is talented for his young age), etc.
评论 #43969764 未加载
nikolayasdf1238 days ago
2GB for 0.5B smallest model. it does not make sense for each app to download this. apple must have plans to pre-load these models on os level and expose SDK for all apps to call these models locally. exciting times!<p>opened issue for them to confirm this: <a href="https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;ml-fastvlm&#x2F;issues&#x2F;7">https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;ml-fastvlm&#x2F;issues&#x2F;7</a>
评论 #43974212 未加载
评论 #43972589 未加载
评论 #43971097 未加载
评论 #43975748 未加载
liamwire8 days ago
It feels like this is the required level of speed-up needed re. time-to-first-token to make continuous vision useful for on-device applications like an assistant that can see and take action on your screen, ala the original Apple Intelligence demos. It’s very impressive seeing the app in the repo and I’m excited to build it tonight and play around.
Aeroi8 days ago
I built&#x2F;building a realtime voice+vision app called Sen, its currently live in beta and streams frames over webrtc. It&#x27;s fast and smart, but Im super curious to see how these models do as we get closer to the metal. I can see these running on-device in the future with super fast ttfb.
评论 #43969581 未加载
d3k8 days ago
Very nice! I wish they were more keen to contribute to AI&#x2F;ML community an publish weights and model definition on HuggingFace. Funny enough I have just seen today a similar demo that is using a freely available VLM: <a href="https:&#x2F;&#x2F;github.com&#x2F;ngxson&#x2F;smolvlm-realtime-webcam">https:&#x2F;&#x2F;github.com&#x2F;ngxson&#x2F;smolvlm-realtime-webcam</a>
评论 #43971752 未加载
porphyra8 days ago
It seems that the future of robotics is VLA models. Even Tesla FSD is an end-to-end VLA model. Efficient vision encoding will be a huge part of making robots safe and responsive.
nine_k8 days ago
With that, a really helpful aid for blind people can be made, running just on their phone, fed from a camera in their eyeglasses. Somebody who could not move around without an assistant could become autonomous in daily life.
评论 #43975013 未加载
labadal7 days ago
I&#x27;m absolutely thrilled that there is an effort to make models smaller and run with less resources instead of blindly throwing more resources at the problem and expecting it to get solved.
lynx978 days ago
I wonder, can I convert&#x2F;run this with llama.cpp? It being LLaVA based seems promising.
BryanLegend8 days ago
Seems like the main thing holding these new minds back is being able to see well. Breakthroughs like this will fix that.
评论 #43969114 未加载
nikolayasdf1238 days ago
distributing this heavy compute and moving it close to device where 1. source of data happens; 2. decision and output about the result of analysis is done; is way to go. super low latency, no network traffic, privacy, less overhead in cloud. this is amazing
adamsiem8 days ago
Anyone using vision to parse screenshots? QVQ was too slow. Will give this a shot.
评论 #43969412 未加载
评论 #43969402 未加载
nikolayasdf1238 days ago
google and cloud LLM providers must be biting their teeth now! haha
buyucu8 days ago
where is my gguf?
turnsout8 days ago
Apple has gotten a slow start in the LLM world, but they have the only long term strategy that makes sense. They’re going to dominate the 2030s.
评论 #43969562 未加载
评论 #43969199 未加载
评论 #43969203 未加载
评论 #43973102 未加载
vessenes8 days ago
Um wow. The on-device realtime videos are worth a watch, and compelling. Looking forward to this being deployed and widely adopted. Getting much faster time to first token opens up a ton of features and usability benefits.
vFunct8 days ago
Can it fill a wine glass to the rim?
评论 #43969297 未加载
simianparrot8 days ago
I have a feeling feeding tesseract the image every 1 second would be significantly faster and take far less space and processing power? Haven&#x27;t tested it yet but given how fast tesseract is on large images, it wouldn&#x27;t surprise me.
评论 #43971181 未加载
kamranjon8 days ago
Apple out here playing 5d chess, installing neural cores in their hardware and writing crazy efficient vision models to run on em. Cool stuff.
评论 #43969204 未加载