I’ve read the paper and the skeptical comments here, to wit: it’s just an actor/critic pipeline by another name.<p>I’ll bite and say this is actually interesting — and the paper title is misleading.<p>What they’ve done here is hooked up a text-only LLM to multimodal critics, given it (mostly) an image diffusion generation task, and asked it to improve its prompting of the multimodal generation by getting a set of scores back.<p>This definitely works, based on their outputs. Which is to say, LLMs can, zero shot, with outside tool feedback, iteratively improve their prompting using only that tooling feedback.<p>Why is this interesting? Well, this did not work in the GPT-3 era; it seems to do so now. I see this as an interesting line to be added in the ‘model capabilities’ box as our models get larger and more sophisticated — the LLMs can perform some sort of internally guided search against a black box generator and use a black box scorer to improve at inference time.<p>That’s pretty cool. It’s also generalizable, and I think is worth keeping in mind on the stack of possible approaches for, say agentic coding, that you can use a critic to not just ‘improve’ generated output, but most likely do some guided search through output space.
My photoresistor nightlight can "see" that it is dark and it "knows" to turn on the light - not only does it not have training, it does not have any code!<p>And if you think that is amazing, my bi-metallic strip thermostat "feels" the temperature and then modifies the environment because it "knows" if it's hot to turn on the A/C, and if it's cold to turn on the heat - no training or code!<p>All of this AI stuff is just unbelievably incredible - what a brave new world (of word games)!
To people curious or skeptical if this could be called “seeing” or “hearing”, I recommend listening to the Batman podcast episode on NPR (<a href="https://www.npr.org/2015/01/23/379134306/batman-pt-1" rel="nofollow">https://www.npr.org/2015/01/23/379134306/batman-pt-1</a>)<p>Through the story and experience of a blind man, they end up getting into the question of what does it mean to see<p>The podcast is pretty straightforward, but it does end up showing that defining “seeing” is a philosophical question, rather than a simple obvious answer
Exactly how little training is "without any"? I'm assuming that companies haven't been spending billions trying to train LLMs to better understand things when they can do it without any training.
Emergent capabilities have been one of the wildest developments in software. For most traditional programmers you learn quickly and with great pain that the computer only does what you explicitly program it to do, no more, no less, and unintended behavior is a bug (and if you’re lucky, an accidental feature).<p>But the idea that entire abilities just emerge from scale… I still have a hard time accepting it.
I think there is potentially a powerful method here. Specifically, the optimal context for a given task can be saved and a meta-learner can be trained to map the task to the context. This would allow fine tuning a model for some specific task without retaining the LLM. For example, generating an SEM image with of some material with a specified porosity and grain size.
Is the LLM essentially playing "Wordle" with an external system that rates the quality of its output, gradually climbing the score ladder until it produces good results?
The paper certainly contradicts my expectation from the title. I.e. it does not present an LLM that can generate images without any access to images before.
I just remember Zuck's comments about AI and how the idea of it dooming our species is a bit silly, etc<p>This is the wrong approach to take. At minimum you have to say things like "well yes we're always on the lookout for this kind of thing". With him? Not a care in the world