Visual Reasoning Is Coming Soon

122 pointsby softwaredougabout 1 month ago

11 comments

AIPedantabout 1 month ago

This seems to ignore the mixed record of video generation models:<pre><code> For visual reasoning practice, we can do supervised fine-tuning on sequences similar to the marble example above. For instance, to understand more about the physical world, we can show the model sequential pictures of Slinkys going down stairs, or basketball players shooting 3-pointers, or people hammering birdhouses together.... But where will we get all this training data? For spatial and physical reasoning tasks, we can leverage computer graphics to generate synthetic data. This approach is particularly valuable because simulations provide a controlled environment where we can create scenarios with known outcomes, making it easy to verify the model's predictions. But we'll also need real-world examples. Fortunately, there's an abundance of video content online that we can tap into. While initial datasets might require human annotation, soon models themselves will be able to process videos and their transcripts to extract training examples automatically. </code></pre> Almost every video generator makes constant "folk physics" errors and doesn't understand object permanence. DeepMind's Veo2 is very impressive but still struggles with object permanence and qualitatively nonsensical physics: <a href="https://x.com/Norod78/status/1894438169061269750" rel="nofollow">https://x.com/Norod78/status/1894438169061269750</a>Humans do not learn these things by pure observation (newborns understand object permanence, I suspect this is the case for all vertebrates). I doubt transformers are capable of learning it as robustly, even if trained on all of YouTube. There will always be "out of distribution" physical nonsense involving mistakes humans (or lizards) would never make, even if they've never seen the specific objects.

评论 #43635303 未加载

评论 #43635540 未加载

评论 #43636134 未加载

nkingsyabout 1 month ago

The example of the cat and detective hat shows that even with the latest update, it isn't "editing" the image. The generated cat is younger, with bigger, brighter eyes, more "perfect" ears.I found that when editing images of myself, the result looked weird, like a funky version of me. For the cat, it looks "more attractive" I guess, but for humans (and I'd imagine for a cat looking at the edited cat with a keen eye for cat faces), the features often don't work together when changed slightly.

评论 #43635244 未加载

评论 #43642255 未加载

Tiberiumabout 1 month ago

It's sad that they used 4o's image generation feature for the cat example which does some diffusion or something else, results in the whole image changing. They should've instead used Gemini 2.0 Flash's image generation feature (or at least mentioned it!), which, even if far lower quality and resolution (max of 1024x1024, but Gemini will try to match the res of the original image, so you can get something like 681x1024), is much much better at leaving the untouched parts of the image actually "untouched".Here's the best out of a few attempts for a really similar prompt, more detailed since Flash is a much smaller model "Give the cat a detective hat and a monocle over his right eye, properly integrate them into the photo.". You can see how the rest of the image is practically untouched to the naked human eye: <a href="https://ibb.co/zVgDbqV3" rel="nofollow">https://ibb.co/zVgDbqV3</a>Honestly Google has been really good at catching up in the LLM race, and their modern models like 2.0 Flash, 2.5 Pro are one of (or the) best in their respective areas. I hope that they'll scale up their image generation feature to base it on 2.5 Pro (or maybe 3 Pro by the time they do it) for higher quality and prompt adherence.If you want, you can give 2.0 Flash image gen a try for free (with generous limits) on <a href="https://aistudio.google.com/prompts/new_chat" rel="nofollow">https://aistudio.google.com/prompts/new_chat</a>, just select it in the model selector on the right.

评论 #43636452 未加载

评论 #43639475 未加载

评论 #43650748 未加载

uaasabout 1 month ago

> Rather watch than read? Hey, I get it - sometimes you just want to kick back and watch! Check out this quick video where I walk through everything in this postHm, no, I’ve never had this thought.

评论 #43644868 未加载

评论 #43639530 未加载

rel_icabout 1 month ago

The inconsistency of an optimistic blog post ending with a picture of a terminator robot makes me think this author isn't taking themself seriously enough. Or - the author is the terminator robot?

评论 #43639908 未加载

porphyraabout 1 month ago

I think that one reason that humans are so good at understanding images is that our eyes see video rather than still images. Video lets us see "cause and effect" by seeing what happens after something. It also allows us to grasp the 3D structure of things since we will almost always see everything from multiple angles. So long as we just feed a big bunch of stills into training these models, it will struggle to understand how things affect one another.

评论 #43635282 未加载

District5524about 1 month ago

The first caption of the cat picture may be a bit misleading for those who are not sure of how this works: "The best a traditional LLM can do when asked to give it a detective hat and monocle." The role of the traditional LLM in creating a picture is quite minimal (if there is any LLMs used), it might just tweak a bit the prompt for the diffusion model. It was definitely not the LLM that created the picture: <a href="https://platform.openai.com/docs/guides/image-generation" rel="nofollow">https://platform.openai.com/docs/guides/image-generation</a> 4o image generation is surely a bit different, but I don't really have that kind of more precise technical information (there must be indeed a specialized transformer model used, linking tokens to pixels, <a href="https://openai.com/index/introducing-4o-image-generation/" rel="nofollow">https://openai.com/index/introducing-4o-image-generation/</a>)

CSMastermindabout 1 month ago

What's interesting to me is how many of these advancements are just obvious next steps for these tools. Chain of thought, tree of thought, mixture of experts etc. are things you'd come up with in the first 10 minutes of thinking about improving LLMs.Of course the devil's always in the details and there have been real non-obvious advancements at the same time.

评论 #43639831 未加载

anxooabout 1 month ago

"I set a plate on a table, and glass next to it. I set a marble on the plate. Then I pick up the marble, drop it in the glass. Then I turn the glass upside down and set it on the plate. Then, I pick up the glass and put it in the microwave. Where is the marble?"the author claims that visual reasoning will help the model solve this problem, noting that gpt-4o got the question right after making a mistake in the beginning of the response. i asked gpt-4o, claude 3.7, and gemini 2.5 pro experimental, who all answered 100% correctly.the author also demonstrates trying to do "visual reasoning" with gpt-4o, notes that the model got it wrong, then handwaves it away by saying the model wasn't trained for visual reasoning."visual reasoning" is a tweet-worthy thought that the author completely fails to justify

KTibowabout 1 month ago

I've seen some speculate that o3 is already using visual reasoning and that's what made it a breakthrough model.

thierrydamibaabout 1 month ago

Excellent write up.The example you used to demonstrate is well done.Such a simple way to describe the current issues.