TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

LLMs can see and hear without any training

210 pointsby T-A21 days ago

17 comments

vessenes21 days ago
I’ve read the paper and the skeptical comments here, to wit: it’s just an actor&#x2F;critic pipeline by another name.<p>I’ll bite and say this is actually interesting — and the paper title is misleading.<p>What they’ve done here is hooked up a text-only LLM to multimodal critics, given it (mostly) an image diffusion generation task, and asked it to improve its prompting of the multimodal generation by getting a set of scores back.<p>This definitely works, based on their outputs. Which is to say, LLMs can, zero shot, with outside tool feedback, iteratively improve their prompting using only that tooling feedback.<p>Why is this interesting? Well, this did not work in the GPT-3 era; it seems to do so now. I see this as an interesting line to be added in the ‘model capabilities’ box as our models get larger and more sophisticated — the LLMs can perform some sort of internally guided search against a black box generator and use a black box scorer to improve at inference time.<p>That’s pretty cool. It’s also generalizable, and I think is worth keeping in mind on the stack of possible approaches for, say agentic coding, that you can use a critic to not just ‘improve’ generated output, but most likely do some guided search through output space.
评论 #43805370 未加载
评论 #43805271 未加载
评论 #43805175 未加载
EncomLab21 days ago
My photoresistor nightlight can &quot;see&quot; that it is dark and it &quot;knows&quot; to turn on the light - not only does it not have training, it does not have any code!<p>And if you think that is amazing, my bi-metallic strip thermostat &quot;feels&quot; the temperature and then modifies the environment because it &quot;knows&quot; if it&#x27;s hot to turn on the A&#x2F;C, and if it&#x27;s cold to turn on the heat - no training or code!<p>All of this AI stuff is just unbelievably incredible - what a brave new world (of word games)!
评论 #43803832 未加载
nico21 days ago
To people curious or skeptical if this could be called “seeing” or “hearing”, I recommend listening to the Batman podcast episode on NPR (<a href="https:&#x2F;&#x2F;www.npr.org&#x2F;2015&#x2F;01&#x2F;23&#x2F;379134306&#x2F;batman-pt-1" rel="nofollow">https:&#x2F;&#x2F;www.npr.org&#x2F;2015&#x2F;01&#x2F;23&#x2F;379134306&#x2F;batman-pt-1</a>)<p>Through the story and experience of a blind man, they end up getting into the question of what does it mean to see<p>The podcast is pretty straightforward, but it does end up showing that defining “seeing” is a philosophical question, rather than a simple obvious answer
scribu21 days ago
This seems to be a system to generate better prompts to be fed into a base multimodal model.<p>Interesting, but title is definitely clickbait.
评论 #43804623 未加载
评论 #43805575 未加载
评论 #43803792 未加载
underdeserver21 days ago
Paper: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2501.18096" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2501.18096</a>
评论 #43803857 未加载
viraptor21 days ago
That looks like a classic Actor&#x2F;Critic setup, yet it&#x27;s not mentioned even once in the paper. Am I missing some large difference here?
评论 #43805116 未加载
评论 #43803979 未加载
JoBrad21 days ago
Exactly how little training is &quot;without any&quot;? I&#x27;m assuming that companies haven&#x27;t been spending billions trying to train LLMs to better understand things when they can do it without any training.
qgin20 days ago
Emergent capabilities have been one of the wildest developments in software. For most traditional programmers you learn quickly and with great pain that the computer only does what you explicitly program it to do, no more, no less, and unintended behavior is a bug (and if you’re lucky, an accidental feature).<p>But the idea that entire abilities just emerge from scale… I still have a hard time accepting it.
robocop_legacy21 days ago
I think there is potentially a powerful method here. Specifically, the optimal context for a given task can be saved and a meta-learner can be trained to map the task to the context. This would allow fine tuning a model for some specific task without retaining the LLM. For example, generating an SEM image with of some material with a specified porosity and grain size.
v01rt21 days ago
&quot;without training&quot; <i>describes transfer learning with an actor &#x2F; critic approach</i>
TheCoreh21 days ago
Is the LLM essentially playing &quot;Wordle&quot; with an external system that rates the quality of its output, gradually climbing the score ladder until it produces good results?
sega_sai21 days ago
The paper certainly contradicts my expectation from the title. I.e. it does not present an LLM that can generate images without any access to images before.
jagged-chisel21 days ago
Computers can receive input without any programming. Not sure what’s interesting here.
评论 #43803658 未加载
评论 #43804762 未加载
评论 #43804765 未加载
评论 #43803624 未加载
评论 #43803688 未加载
alex113821 days ago
I just remember Zuck&#x27;s comments about AI and how the idea of it dooming our species is a bit silly, etc<p>This is the wrong approach to take. At minimum you have to say things like &quot;well yes we&#x27;re always on the lookout for this kind of thing&quot;. With him? Not a care in the world
gitroom20 days ago
pretty cool seeing models get a bit smarter each time - always makes me wonder how much of this is luck vs real skill tbh
3rdworldeng21 days ago
Find me Jose Monkey will do that too :-)
v-rt21 days ago
&quot;without training&quot; <i>describes transfer learning</i>
评论 #43806439 未加载