Interesting to me that this one can draw legible text. DALLE models seem to generate weird glyphs that only look like text. The examples they show here have perfectly legible characters and correct spelling. The difference between this and DALLE makes me suspicious / curious. I wish I could play with this model.
I know that some monstrous majority of cognitive processing is visual, hence the attention these visually creative models are rightfully getting, but personally I am much more interested in auditory information and would love to see a promptable model for music. Was just listening to "Land Down Under" from Men At Work. Would love to be able to prompt for another artist I have liked: "Tricky playing Land Down Under." I know of various generative music projects, going back decades, and would appreciate pointers, but as far as I am aware we are still some ways from Imagen/Dalle for music?
Interesting discovery they made<p>> We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.<p>There seems to be an unexpected level of synergy between text and vision models. Can't wait to see what video and audio modalities will add to the mix.
Would be fascinated to see the DALL-E output for the same prompts as the ones used in this paper. If you've got DALL-E access and can try a few, please put links as replies!
Can anybody give me short high-level explanation how the model achieves these results? I'm especially interested in the image synthesis, not the language parsing.<p>For example, what kind of source images are used for the snake made of corn[0]? It's baffling to me how the corn is mapped to the snake body.<p>[0] <a href="https://gweb-research-imagen.appspot.com/main_gallery_images/corn-snake-on-farm.jpg" rel="nofollow">https://gweb-research-imagen.appspot.com/main_gallery_images...</a>
As someone who has a layman's understanding of neural networks, and who did some neural network programming ~20 years ago before the real explosion of the field, can someone point to some resources where I can get a better understanding about how this magic works?<p>I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing. Just looking for more information about how the software actually works, even if there are big chunks of it that are "this is beyond your understanding without taking some in-depth courses".
>While we leave an in-depth empirical analysis of social and cultural biases to future work, our small scale internal assessments reveal several limitations that guide our decision not to release our model at this time.<p>Some of the reasoning:<p>>Preliminary assessment also suggests Imagen encodes several social biases and stereotypes, including an overall bias towards generating images of people with lighter skin tones and a tendency for images portraying different professions to align with Western gender stereotypes. Finally, even when we focus generations away from people, our preliminary analysis indicates Imagen encodes a range of social and cultural biases when generating images of activities, events, and objects. We aim to make progress on several of these open challenges and limitations in future work.<p>Really sad that breakthrough technologies are going to be withheld due to our inability to cope with the results.
One thing that no one predicted in AI development was how good it would become at some completely unexpected tasks while being not so great at the ones we supposed/hoped it would be good.<p>AI was expected to grow like a child. Somehow blurting out things that would show some increasing understanding on a deep level but poor syntax.<p>In fact we get the exact opposite. AI is creating texts that are syntaxically correct and very decently articulated and pictures that are insanely good.<p>And these texts and images are created from a text prompt?! There is no way to interface with the model other than by freeform text. That is so weird to me.<p>Yet it doesn’t feel intelligent at all at first. You can’t ask it to draw “a chess game with a puzzle where white mates in 4 moves”.<p>Yet sometimes GPT makes very surprising inferences. And it starts to feel like there is something going on a deeper level.<p>DeepMind’s AlphaXxx models are more in line with how I expected things to go. Software that gets good at expert tasks that we as humans are too limited to handle.<p>Where it’s headed, we don’t know. But I bet it’s going to be difficult to tell the “intelligence” from the “varnish”
It’s terrifying that all of these models are one colab notebook away from unleashing unlimited, disastrous imagery on the internet. At least some companies are starting to realize this and are not releasing the source code. However they always manage to write a scientific paper and blog post detailing the exact process to create the model, so it will eventually be recreated by a third party.<p>Meanwhile, Nvidia sees no problem with yeeting stylegan and and models that allow real humans to be realistically turned into animated puppets in 3d space. The inevitable end result of these scientific achievements will be orders of magnitude worse than deepfakes.<p>Oh, or a panda wearing sunglasses, in the desert, digital art.
I apologize in advance for the elitist-sounding tone. In my defense the people I’m calling elite I have nothing to do with, I’m certainly not talking about myself.<p>Without a fairly deep grounding in this stuff it’s hard to appreciate how far ahead Brain and DM are.<p>Neither OpenAI nor FAIR <i>ever has the top score on anything unless Google delays publication</i>. And short of FAIR? D2 lacrosse. There are exceptions to such a brash generalization, NVIDIA’s group comes to mind, but it’s a very good rule of thumb. Or your whole face the next time you are tempted to doze behind the wheel of a Tesla.<p>There are two big reasons for this:<p>- the talent wants to work with the other talent, and through a combination of foresight and deep pockets Google got that exponent on their side right around the time NVIDIA cards started breaking ImageNet. Winning the Hinton bidding war clinched it.<p>- the current approach of “how many Falcon Heavy launches worth of TPU can I throw at the same basic masked attention with residual feedback and a cute Fourier coloring” inherently favors deep pockets, and obviously MSFT, sorry OpenAI has that, but deep pockets also non-linearly scale outcomes when you’ve got in-house hardware for multiply-mixed precision.<p>Now clearly we’re nowhere close to Maxwell’s Demon on this stuff, and sooner or later some bright spark is going to break the logjam of needing 10-100MM in compute to squeeze a few points out of a language benchmark. But the incentives are weird here: who, exactly, does it serve for us plebs to be able to train these things from scratch?
I have to wonder how much releasing these models will "poison the well" and fill the internet with AI generated images that make training an improved model difficult. After all if every 9/10 "oil painted" image online starts being from these generative models it'll become increasingly difficult to scrape the web and to learn from real world data in a variety of domains. Essentially once these things are widely available the internet will become harder to scrape for good data and models will start training on their own output. The internet will also probably get worse for humans since search results will be completely polluted with these "sort of realistic" images which can ultimately be spit out at breakneck speed by smashing words from a dictionary together...
Generating at 64x64px then upscaling it probably gives the model a substantial performance boost (training speed/convergence) than working at 256x256 or 1024x1024 like DALL-E 2. Perhaps that approach to AI-generated art is the future.
I thought I was doing well after not being overly surprised by DALL-E 2 or Gato. How am I still not calibrated on this stuff? I know I am meant to be the one who constantly argues that language models already have sophisticated semantic understanding, and that you don't need visual senses to learn grounded world knowledge of this sort, but come on, you don't get to just throw T5 in a multimodal model as-is and have it work better than multimodal transformers! VLM[1] at least added fine-tuned <i>internal components</i>.<p>Good lord we are screwed. And yet somehow I bet even this isn't going to kill off the <i>they're just statistical interpolators</i> meme.<p>[1] <a href="https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model" rel="nofollow">https://www.deepmind.com/blog/tackling-multiple-tasks-with-a...</a>
Is there a way to try this out? DALL-E2 also had amazing demos but the limitations became apparent once real people had a chance to run their own queries.
The big thing I’m noticing over DALL-E is that it seems to be better at relative positioning. In a MKBHD video about DALLE it would get the elements but not always in the right order. I know google curated some specific images but it seems to be doing a better job there.
Does it do partial image reconstruction like DALL-E2? Where you cut out part of an existing image and the neural network can fill it back in.<p>I believe this type of content generation will be the next big thing or at least one of them. But people will want some customization to make their pictures “unique” and fix AI’s lack of creativity and other various shortcomings. Plus edit out the remaining lapses in logic/object separation (which there are some even in the given examples).<p>Still, being able to create arbitrary stock photos is really useful and i bet these will flood small / low-budget projects
Really impressive. If we are able to generate such detailed images, is there anything similar for text to music? I would I though that it would be simpler to achieve than text to image.
when will there be a "DALL-E for porn" ? or is this domain also claimed by Puritans and morality gate keepers? The most in demand text-to-image is use case is for porn.
Probably just a frontend coding mistake, and not an error in the model, but in the interactive example if you select:<p>"A photo of a
Shiba Inu dog
Wearing a (sic) sunglasses
And black leather jacket
Playing guitar
In a garden"<p>The Shiba Inu is not playing a guitar.
Off topic, but this caught my attention:<p>“In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access.”<p>I work for a big org myself, and I’ve wondered what it is exactly that makes people in big orgs so bad at saying things.
It really does look better than DALL-E, at least from the images on the site. Hard to believe how quickly progress is being made to lucid dreaming while awake.
All of these AI findings are cool in theory. But until its accessible to some decent amount of people/customers - its basically useless fluff.<p>You can tell me those pictures are generated by an AI and I might believe it, but until real people can actually test it... it's easy enough to fake. This page isn't even the remotest bit legit by the URL, It looks nicely put together and that's about it. Could have easily put together this with a graphic designer to fake it.<p>Let be clear, I'm not actually saying it's fake. Just that all of these new "cool" things are more or less theoretical if nothing is getting released.
Reading a relatively-recent Machine Learning paper from some elite source, and after multiple repititions of bragging and puffery, in the middle of the paper, the charts show that they had beaten the score of a high-ranking algorithm in their specific domain, moving the best consistant result from 86% accuracy to 88% accuracy, somewhere around there. My response was: they got a lot of attention within their world by beating the previous score, no matter how small the improvement was.. it was a "winner take all" competition against other teams close to them; the accuracy of less than 90% is really of questionable value in a lot of real world problems; it was an enormous amount of math and effort for this team to make that small improvement.<p>What I see is semi-poverty mindset among very smart people who appear to be treated in a way such that the winners get promotion, and everyone else is fired. That this sort of analysis with ML is useful for massive data sets at scale, where 90% is a lot of accuracy, not at all for the small sets of real world, human-scale problems where each result may matter a lot. The amount of years of training that these researchers had to go through, to participate in this apparently ruthless environment, are certainly like a lottery ticket, if you are in fact in a game where everyone but the winner has to find a new line of work. I think their masters live in Redmond, if I recall.. not looking it up at the moment.
Used some of the same prompts and generated results with open source models, model I am using fails on long prompts but does well on short and descriptive prompts. Results:<p><a href="https://imgur.com/gallery/6qAK09o" rel="nofollow">https://imgur.com/gallery/6qAK09o</a>
Interesting and cool technology - but I can't seem to ignore that every high-quality AI art application is always closed, and I don't seem to buy the ethics excuse for that. The same was said for GPT, yet I see nothing but creativity coming out from its users nowadays.
Would it be bad to release this with a big warning and flashing gifs letting people know of the issues it has and note that they are working to resolve them / ask for feedback / mention difficulties related to resolving the issues they identified?
Nice to see another company making progress in the area. I'd love to see more examples of different artistic styles though, my favorite DALL-E images are the ones that look like drawings.
I find it a bit disturbing that they talk about social impact of totally imaginary pictures of racoon.<p>Of course, working in a golden lab at Google may twist your views on society.
Primarily Indian origin authors on both the DALL-E and this research paper. Just found that impressive considering they make up 1% of the population in the US.
It seems to have the same "adjectives bleed into everything problem" that Dall-E does.<p>Their slider with examples at the top showed a prompt along the lines of "a chrome plated duck with a golden beak confronting a turtle in a forest" and the resulting image was perfect - except the turtle had a golden shell.
What's the limiting factor for model replication by others? Amount of compute? Model architecture? Quality / Quantity of training data? Would really appreciate insights on the subjects
Also, almost 40 years ago, the name of a laser printer capable of 200 dpi.<p>Almost there, the Apple Laserwriter nailed it at 300 dpi.<p>Sometimes sneaked an issue of the "SF-Lovers Digest" in between code printouts.
This would generate great music videos for Bob Dylan songs. I'm thinking Gates of Eden, ".. upon four legged forest cloud, the cowboy angel rides" :D
I'm curious why all of these tools seem to be almost tailored toward making meme images?<p>The kind of early 2010's, over the top description of something that's ridiculous
Next phase of all this: Image to 3d printable template files compatible with various market available printers.<p>Print me a racoon in a leather jacket riding a skateboard.
I get the impression that maybe DALL-E 2 produces slightly more diverse images? Compare Figure 2 in this paper with Figures 18-20 in the DALL-E 2 paper.
Seeing the artificial restrictions to this model as well as to DALL-E 2, I can't help but ask myself why the porn industry isn't driving its own research. Given the size of that industry and the sheer abundance of training material, it seems just a matter of time until you can create photo realistic images of yourself with your favourite celebrity for a small fee. Is there anything I am missing? Can you only do this kind of research at google or openai scale?
Metacalculus, a mass forecasting site, has steadily brought forward the prediction date for a weakly general AI. Jaw-dropping advances like this, only increase my confidence in this prediction. "The future is now, old man."<p><a href="https://www.metaculus.com/questions/3479/date-weakly-general-ai-system-is-devised/" rel="nofollow">https://www.metaculus.com/questions/3479/date-weakly-general...</a>
Is there anything at all, besides the training images and labels, that would stop this from generating a convincing response to "A surveillance camera image of Jared Kushner, Vladimir Putin, and Alexandria Ocasio-Cortez naked on a sofa. Jeffrey Epstein is nearby, snorting coke off the back of Elvis"?
This competitor might be better for respecting spatial prepositions and photorealism but on a quick look i find the images more uncanny.
DALL-E has IMHO better camera POV/distance and is able to make artistic/dreamy/beautiful images. I haven't yet seen this Google model be competitive for art and uncaniness.
However progress is great and I might be wrong.
Note that there was a close model in 2021 ignored by all
<a href="https://paperswithcode.com/sota/text-to-image-generation-on-coco" rel="nofollow">https://paperswithcode.com/sota/text-to-image-generation-on-...</a> (on this benchmark)
Also what is the score of dalle v2?
Hey I also wrote a neural net that generates perfect images. Here's a static site about it. With images it definitely generated! Can you use it? Is there a source? Hah, of course not, because ethics!