科技回声

19 条评论

As the father of a young child whose optic nerves are highly deteriorated (compression) and is expected to lose his sight (when exactly is unknown; based on original projections he should be blind by now, but an experimental treatment run in a trial at the NIH (KEEP FUNDING SCIENCE) has stabilized his sight), I'm overjoyed with the advances being made in VLMs. I can now envision a future where even if he loses his sight he'll be able to interact with the world around him, go to college, have a fulfilling career (he loves science and engineering, and is talented for his young age), etc.

评论 #43969764 未加载

nikolayasdf1233 天前

2GB for 0.5B smallest model. it does not make sense for each app to download this. apple must have plans to pre-load these models on os level and expose SDK for all apps to call these models locally. exciting times!<p>opened issue for them to confirm this: <a href="https://github.com/apple/ml-fastvlm/issues/7">https://github.com/apple/ml-fastvlm/issues/7</a>

评论 #43974212 未加载

评论 #43972589 未加载

评论 #43971097 未加载

评论 #43975748 未加载

liamwire3 天前

It feels like this is the required level of speed-up needed re. time-to-first-token to make continuous vision useful for on-device applications like an assistant that can see and take action on your screen, ala the original Apple Intelligence demos. It’s very impressive seeing the app in the repo and I’m excited to build it tonight and play around.

Aeroi3 天前

I built/building a realtime voice+vision app called Sen, its currently live in beta and streams frames over webrtc. It's fast and smart, but Im super curious to see how these models do as we get closer to the metal. I can see these running on-device in the future with super fast ttfb.

评论 #43969581 未加载

d3k3 天前

Very nice! I wish they were more keen to contribute to AI/ML community an publish weights and model definition on HuggingFace. Funny enough I have just seen today a similar demo that is using a freely available VLM: <a href="https://github.com/ngxson/smolvlm-realtime-webcam">https://github.com/ngxson/smolvlm-realtime-webcam</a>

评论 #43971752 未加载

porphyra3 天前

It seems that the future of robotics is VLA models. Even Tesla FSD is an end-to-end VLA model. Efficient vision encoding will be a huge part of making robots safe and responsive.

labadal3 天前

I'm absolutely thrilled that there is an effort to make models smaller and run with less resources instead of blindly throwing more resources at the problem and expecting it to get solved.

nine_k3 天前

With that, a really helpful aid for blind people can be made, running just on their phone, fed from a camera in their eyeglasses. Somebody who could not move around without an assistant could become autonomous in daily life.

评论 #43975013 未加载

lynx973 天前

I wonder, can I convert/run this with llama.cpp? It being LLaVA based seems promising.

BryanLegend3 天前

Seems like the main thing holding these new minds back is being able to see well. Breakthroughs like this will fix that.

评论 #43969114 未加载

nikolayasdf1233 天前

distributing this heavy compute and moving it close to device where 1. source of data happens; 2. decision and output about the result of analysis is done; is way to go. super low latency, no network traffic, privacy, less overhead in cloud. this is amazing

adamsiem3 天前

Anyone using vision to parse screenshots? QVQ was too slow. Will give this a shot.

评论 #43969412 未加载

评论 #43969402 未加载

nikolayasdf1233 天前

google and cloud LLM providers must be biting their teeth now! haha

buyucu3 天前

where is my gguf?

turnsout3 天前

Apple has gotten a slow start in the LLM world, but they have the only long term strategy that makes sense. They’re going to dominate the 2030s.

评论 #43969562 未加载

评论 #43969199 未加载

评论 #43973102 未加载

评论 #43969203 未加载

vessenes3 天前

Um wow. The on-device realtime videos are worth a watch, and compelling. Looking forward to this being deployed and widely adopted. Getting much faster time to first token opens up a ton of features and usability benefits.

vFunct3 天前

Can it fill a wine glass to the rim?

评论 #43969297 未加载

simianparrot3 天前

I have a feeling feeding tesseract the image every 1 second would be significantly faster and take far less space and processing power? Haven't tested it yet but given how fast tesseract is on large images, it wouldn't surprise me.

评论 #43971181 未加载

kamranjon3 天前

Apple out here playing 5d chess, installing neural cores in their hardware and writing crazy efficient vision models to run on em. Cool stuff.

评论 #43969204 未加载

19 条评论

insane_dreamer3 天前

评论 #43969764 未加载

nikolayasdf1233 天前

评论 #43974212 未加载

评论 #43972589 未加载

评论 #43971097 未加载

评论 #43975748 未加载

liamwire3 天前

Aeroi3 天前

评论 #43969581 未加载

d3k3 天前

评论 #43971752 未加载

porphyra3 天前

It seems that the future of robotics is VLA models. Even Tesla FSD is an end-to-end VLA model. Efficient vision encoding will be a huge part of making robots safe and responsive.

labadal3 天前

I'm absolutely thrilled that there is an effort to make models smaller and run with less resources instead of blindly throwing more resources at the problem and expecting it to get solved.

nine_k3 天前

评论 #43975013 未加载

lynx973 天前

I wonder, can I convert/run this with llama.cpp? It being LLaVA based seems promising.

BryanLegend3 天前

Seems like the main thing holding these new minds back is being able to see well. Breakthroughs like this will fix that.

评论 #43969114 未加载

nikolayasdf1233 天前

adamsiem3 天前

Anyone using vision to parse screenshots? QVQ was too slow. Will give this a shot.

评论 #43969412 未加载

评论 #43969402 未加载