> The central intuition in using T5 is that extremely large language models, by virtue of their sheer size alone, may still learn useful representations despite the fact that they are not explicitly trained with any text/image task in mind. [...] Therefore, the central question being addressed by this choice is whether or not a massive language model trained on a massive dataset independent of the task of image generation is a worthwhile trade-off for a non-specialized text encoder. The Imagen authors bet on the side of the large language model, and it is a bet that seems to pay off well.<p>The way out of this dilemma is to fine-tune T5 on the caption dataset instead of keeping it frozen. The paper notes that they don't do fine-tuning, but does not provide any ablation or other justification. I wonder if it would help or not.
> is trained on hundreds of millions of images and their associated captions<p>So how do you get access to hundreds of millions of images and use them to create derivative works? Did they get consent from millions of authors?<p>Or is something like that only available to the rich with access to lawyers on tap?<p>I mean I can imagine if a nobody wanted to do something like this, they'd get bankrupted by having to deal with all the photographers / artists spotting a tiny sliver of their art in the image produced by the model.<p>Furthermore, would something like this work with music? For instance, train the model on all Spotify songs and then generate songs based on "Get me a Bach symphony played on sticks with someone rapping like Dr Dre with lisp."
Or do music industry have enough money to bully anyone into not doing that?
Is there a compare and contrast between Imagen and Parti anywhere? I realize the paper came out yesterday, but maybe other people remember what "autoregressive" means better than I do.
I have shown imagen (and dalle2) to a number of people now (non-tech, just everyday friends, family, co-workers) and I have been pretty stunned by the response I get from most people:<p>"Meh, that's kinda cool? I guess?" or "What am I looking at?"..."Ok? So a computer made it? That seems neat"<p>To me I am still trying to get my jaw off the floor from 2 months ago. But the responses have been so muted and shoulder shrugging that I think either I am missing something or they are missing something. Even really drilling in, practically shaking them "DO YOU NOT UNDERSTAND THAT THIS IS A ORIGINAL IMAGE CONSTRUCTED ENTIRELY BY AN AI?!?!" and people just seem to see it as a party trick at best.
> Imagen, released just last month, can generate high-quality, high-resolution images given only a description of a scene<p>“Released”? What? Papers are published. Websites are published. Tools are “released.”<p>Where has Imagen been released?
Wait, this isn't about the line of intelligent xeroxographic laser printers developed by Imagen Corporation in 1981, supporting the Impress printer language?<p><a href="https://tug.org/TUGboat/tb02-2/tb03imagen.pdf" rel="nofollow">https://tug.org/TUGboat/tb02-2/tb03imagen.pdf</a><p><a href="https://www.openprinting.org/driver/imagen" rel="nofollow">https://www.openprinting.org/driver/imagen</a>