As the father of a young child whose optic nerves are highly deteriorated (compression) and is expected to lose his sight (when exactly is unknown; based on original projections he should be blind by now, but an experimental treatment run in a trial at the NIH (KEEP FUNDING SCIENCE) has stabilized his sight), I'm overjoyed with the advances being made in VLMs. I can now envision a future where even if he loses his sight he'll be able to interact with the world around him, go to college, have a fulfilling career (he loves science and engineering, and is talented for his young age), etc.
2GB for 0.5B smallest model. it does not make sense for each app to download this. apple must have plans to pre-load these models on os level and expose SDK for all apps to call these models locally. exciting times!<p>opened issue for them to confirm this: <a href="https://github.com/apple/ml-fastvlm/issues/7">https://github.com/apple/ml-fastvlm/issues/7</a>
It feels like this is the required level of speed-up needed re. time-to-first-token to make continuous vision useful for on-device applications like an assistant that can see and take action on your screen, ala the original Apple Intelligence demos. It’s very impressive seeing the app in the repo and I’m excited to build it tonight and play around.
I built/building a realtime voice+vision app called Sen, its currently live in beta and streams frames over webrtc. It's fast and smart, but Im super curious to see how these models do as we get closer to the metal. I can see these running on-device in the future with super fast ttfb.
Very nice! I wish they were more keen to contribute to AI/ML community an publish weights and model definition on HuggingFace.
Funny enough I have just seen today a similar demo that is using a freely available VLM: <a href="https://github.com/ngxson/smolvlm-realtime-webcam">https://github.com/ngxson/smolvlm-realtime-webcam</a>
It seems that the future of robotics is VLA models. Even Tesla FSD is an end-to-end VLA model. Efficient vision encoding will be a huge part of making robots safe and responsive.
With that, a really helpful aid for blind people can be made, running just on their phone, fed from a camera in their eyeglasses. Somebody who could not move around without an assistant could become autonomous in daily life.
I'm absolutely thrilled that there is an effort to make models smaller and run with less resources instead of blindly throwing more resources at the problem and expecting it to get solved.
distributing this heavy compute and moving it close to device where 1. source of data happens; 2. decision and output about the result of analysis is done; is way to go. super low latency, no network traffic, privacy, less overhead in cloud. this is amazing
Um wow. The on-device realtime videos are worth a watch, and compelling. Looking forward to this being deployed and widely adopted. Getting much faster time to first token opens up a ton of features and usability benefits.
I have a feeling feeding tesseract the image every 1 second would be significantly faster and take far less space and processing power? Haven't tested it yet but given how fast tesseract is on large images, it wouldn't surprise me.