TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Imagen, a text-to-image diffusion model

988 pointsby kevemanalmost 3 years ago

72 comments

ALittleLightalmost 3 years ago
Interesting to me that this one can draw legible text. DALLE models seem to generate weird glyphs that only look like text. The examples they show here have perfectly legible characters and correct spelling. The difference between this and DALLE makes me suspicious / curious. I wish I could play with this model.
评论 #31486583 未加载
评论 #31490548 未加载
评论 #31486341 未加载
评论 #31489629 未加载
评论 #31486620 未加载
jonahbentonalmost 3 years ago
I know that some monstrous majority of cognitive processing is visual, hence the attention these visually creative models are rightfully getting, but personally I am much more interested in auditory information and would love to see a promptable model for music. Was just listening to "Land Down Under" from Men At Work. Would love to be able to prompt for another artist I have liked: "Tricky playing Land Down Under." I know of various generative music projects, going back decades, and would appreciate pointers, but as far as I am aware we are still some ways from Imagen/Dalle for music?
评论 #31485727 未加载
评论 #31485545 未加载
visargaalmost 3 years ago
Interesting discovery they made<p>&gt; We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.<p>There seems to be an unexpected level of synergy between text and vision models. Can&#x27;t wait to see what video and audio modalities will add to the mix.
评论 #31486839 未加载
评论 #31487394 未加载
benwikleralmost 3 years ago
Would be fascinated to see the DALL-E output for the same prompts as the ones used in this paper. If you&#x27;ve got DALL-E access and can try a few, please put links as replies!
评论 #31485543 未加载
评论 #31485179 未加载
geonicalmost 3 years ago
Can anybody give me short high-level explanation how the model achieves these results? I&#x27;m especially interested in the image synthesis, not the language parsing.<p>For example, what kind of source images are used for the snake made of corn[0]? It&#x27;s baffling to me how the corn is mapped to the snake body.<p>[0] <a href="https:&#x2F;&#x2F;gweb-research-imagen.appspot.com&#x2F;main_gallery_images&#x2F;corn-snake-on-farm.jpg" rel="nofollow">https:&#x2F;&#x2F;gweb-research-imagen.appspot.com&#x2F;main_gallery_images...</a>
评论 #31490280 未加载
评论 #31490596 未加载
评论 #31490896 未加载
hn_throwaway_99almost 3 years ago
As someone who has a layman&#x27;s understanding of neural networks, and who did some neural network programming ~20 years ago before the real explosion of the field, can someone point to some resources where I can get a better understanding about how this magic works?<p>I mean, from my perspective, the skill in these (and DALL-E&#x27;s) image reproductions is truly astonishing. Just looking for more information about how the software actually works, even if there are big chunks of it that are &quot;this is beyond your understanding without taking some in-depth courses&quot;.
评论 #31485803 未加载
评论 #31485894 未加载
评论 #31485801 未加载
daenzalmost 3 years ago
&gt;While we leave an in-depth empirical analysis of social and cultural biases to future work, our small scale internal assessments reveal several limitations that guide our decision not to release our model at this time.<p>Some of the reasoning:<p>&gt;Preliminary assessment also suggests Imagen encodes several social biases and stereotypes, including an overall bias towards generating images of people with lighter skin tones and a tendency for images portraying different professions to align with Western gender stereotypes. Finally, even when we focus generations away from people, our preliminary analysis indicates Imagen encodes a range of social and cultural biases when generating images of activities, events, and objects. We aim to make progress on several of these open challenges and limitations in future work.<p>Really sad that breakthrough technologies are going to be withheld due to our inability to cope with the results.
评论 #31485036 未加载
评论 #31485370 未加载
评论 #31484972 未加载
评论 #31485046 未加载
评论 #31485189 未加载
评论 #31484978 未加载
评论 #31485077 未加载
评论 #31485465 未加载
评论 #31485164 未加载
评论 #31484924 未加载
评论 #31485725 未加载
评论 #31485474 未加载
评论 #31485720 未加载
评论 #31485337 未加载
评论 #31485174 未加载
评论 #31485635 未加载
评论 #31485575 未加载
评论 #31485294 未加载
评论 #31485860 未加载
评论 #31485558 未加载
评论 #31485400 未加载
评论 #31486547 未加载
评论 #31485190 未加载
评论 #31488710 未加载
评论 #31484960 未加载
评论 #31485657 未加载
评论 #31484985 未加载
评论 #31485381 未加载
评论 #31485274 未加载
andybakalmost 3 years ago
Great. Now even if I do get a Dall-E 2 invite I&#x27;ll still feel like I&#x27;m missing out!
评论 #31485019 未加载
throwaway743almost 3 years ago
<a href="https:&#x2F;&#x2F;github.com&#x2F;lucidrains&#x2F;imagen-pytorch" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lucidrains&#x2F;imagen-pytorch</a>
评论 #31486664 未加载
评论 #31485979 未加载
d--balmost 3 years ago
One thing that no one predicted in AI development was how good it would become at some completely unexpected tasks while being not so great at the ones we supposed&#x2F;hoped it would be good.<p>AI was expected to grow like a child. Somehow blurting out things that would show some increasing understanding on a deep level but poor syntax.<p>In fact we get the exact opposite. AI is creating texts that are syntaxically correct and very decently articulated and pictures that are insanely good.<p>And these texts and images are created from a text prompt?! There is no way to interface with the model other than by freeform text. That is so weird to me.<p>Yet it doesn’t feel intelligent at all at first. You can’t ask it to draw “a chess game with a puzzle where white mates in 4 moves”.<p>Yet sometimes GPT makes very surprising inferences. And it starts to feel like there is something going on a deeper level.<p>DeepMind’s AlphaXxx models are more in line with how I expected things to go. Software that gets good at expert tasks that we as humans are too limited to handle.<p>Where it’s headed, we don’t know. But I bet it’s going to be difficult to tell the “intelligence” from the “varnish”
评论 #31488322 未加载
评论 #31488568 未加载
beeskneecapsalmost 3 years ago
It’s terrifying that all of these models are one colab notebook away from unleashing unlimited, disastrous imagery on the internet. At least some companies are starting to realize this and are not releasing the source code. However they always manage to write a scientific paper and blog post detailing the exact process to create the model, so it will eventually be recreated by a third party.<p>Meanwhile, Nvidia sees no problem with yeeting stylegan and and models that allow real humans to be realistically turned into animated puppets in 3d space. The inevitable end result of these scientific achievements will be orders of magnitude worse than deepfakes.<p>Oh, or a panda wearing sunglasses, in the desert, digital art.
评论 #31493290 未加载
评论 #31493864 未加载
benreesmanalmost 3 years ago
I apologize in advance for the elitist-sounding tone. In my defense the people I’m calling elite I have nothing to do with, I’m certainly not talking about myself.<p>Without a fairly deep grounding in this stuff it’s hard to appreciate how far ahead Brain and DM are.<p>Neither OpenAI nor FAIR <i>ever has the top score on anything unless Google delays publication</i>. And short of FAIR? D2 lacrosse. There are exceptions to such a brash generalization, NVIDIA’s group comes to mind, but it’s a very good rule of thumb. Or your whole face the next time you are tempted to doze behind the wheel of a Tesla.<p>There are two big reasons for this:<p>- the talent wants to work with the other talent, and through a combination of foresight and deep pockets Google got that exponent on their side right around the time NVIDIA cards started breaking ImageNet. Winning the Hinton bidding war clinched it.<p>- the current approach of “how many Falcon Heavy launches worth of TPU can I throw at the same basic masked attention with residual feedback and a cute Fourier coloring” inherently favors deep pockets, and obviously MSFT, sorry OpenAI has that, but deep pockets also non-linearly scale outcomes when you’ve got in-house hardware for multiply-mixed precision.<p>Now clearly we’re nowhere close to Maxwell’s Demon on this stuff, and sooner or later some bright spark is going to break the logjam of needing 10-100MM in compute to squeeze a few points out of a language benchmark. But the incentives are weird here: who, exactly, does it serve for us plebs to be able to train these things from scratch?
评论 #31486123 未加载
评论 #31486540 未加载
评论 #31486041 未加载
评论 #31486472 未加载
评论 #31486185 未加载
评论 #31486080 未加载
评论 #31486062 未加载
评论 #31486599 未加载
qz_kbalmost 3 years ago
I have to wonder how much releasing these models will &quot;poison the well&quot; and fill the internet with AI generated images that make training an improved model difficult. After all if every 9&#x2F;10 &quot;oil painted&quot; image online starts being from these generative models it&#x27;ll become increasingly difficult to scrape the web and to learn from real world data in a variety of domains. Essentially once these things are widely available the internet will become harder to scrape for good data and models will start training on their own output. The internet will also probably get worse for humans since search results will be completely polluted with these &quot;sort of realistic&quot; images which can ultimately be spit out at breakneck speed by smashing words from a dictionary together...
评论 #31487551 未加载
评论 #31488275 未加载
评论 #31487428 未加载
评论 #31487770 未加载
评论 #31488820 未加载
评论 #31487447 未加载
评论 #31487736 未加载
评论 #31487741 未加载
评论 #31490057 未加载
评论 #31500861 未加载
评论 #31564287 未加载
评论 #31488623 未加载
评论 #31503820 未加载
评论 #31501271 未加载
评论 #31498037 未加载
评论 #31488038 未加载
评论 #31488227 未加载
评论 #31487922 未加载
评论 #31490647 未加载
评论 #31489200 未加载
minimaxiralmost 3 years ago
Generating at 64x64px then upscaling it probably gives the model a substantial performance boost (training speed&#x2F;convergence) than working at 256x256 or 1024x1024 like DALL-E 2. Perhaps that approach to AI-generated art is the future.
Veedracalmost 3 years ago
I thought I was doing well after not being overly surprised by DALL-E 2 or Gato. How am I still not calibrated on this stuff? I know I am meant to be the one who constantly argues that language models already have sophisticated semantic understanding, and that you don&#x27;t need visual senses to learn grounded world knowledge of this sort, but come on, you don&#x27;t get to just throw T5 in a multimodal model as-is and have it work better than multimodal transformers! VLM[1] at least added fine-tuned <i>internal components</i>.<p>Good lord we are screwed. And yet somehow I bet even this isn&#x27;t going to kill off the <i>they&#x27;re just statistical interpolators</i> meme.<p>[1] <a href="https:&#x2F;&#x2F;www.deepmind.com&#x2F;blog&#x2F;tackling-multiple-tasks-with-a-single-visual-language-model" rel="nofollow">https:&#x2F;&#x2F;www.deepmind.com&#x2F;blog&#x2F;tackling-multiple-tasks-with-a...</a>
评论 #31486740 未加载
评论 #31486609 未加载
评论 #31487656 未加载
评论 #31486961 未加载
jandresealmost 3 years ago
Is there a way to try this out? DALL-E2 also had amazing demos but the limitations became apparent once real people had a chance to run their own queries.
评论 #31485097 未加载
manchmalscottalmost 3 years ago
The big thing I’m noticing over DALL-E is that it seems to be better at relative positioning. In a MKBHD video about DALLE it would get the elements but not always in the right order. I know google curated some specific images but it seems to be doing a better job there.
评论 #31485003 未加载
armchairhackeralmost 3 years ago
Does it do partial image reconstruction like DALL-E2? Where you cut out part of an existing image and the neural network can fill it back in.<p>I believe this type of content generation will be the next big thing or at least one of them. But people will want some customization to make their pictures “unique” and fix AI’s lack of creativity and other various shortcomings. Plus edit out the remaining lapses in logic&#x2F;object separation (which there are some even in the given examples).<p>Still, being able to create arbitrary stock photos is really useful and i bet these will flood small &#x2F; low-budget projects
endisneighalmost 3 years ago
I give it a few years before Google makes stock images irrelevant.
评论 #31485495 未加载
评论 #31484979 未加载
评论 #31485871 未加载
评论 #31485166 未加载
评论 #31484947 未加载
y04nnalmost 3 years ago
Really impressive. If we are able to generate such detailed images, is there anything similar for text to music? I would I though that it would be simpler to achieve than text to image.
评论 #31485195 未加载
评论 #31485669 未加载
评论 #31485235 未加载
评论 #31485127 未加载
tomatowurstalmost 3 years ago
when will there be a &quot;DALL-E for porn&quot; ? or is this domain also claimed by Puritans and morality gate keepers? The most in demand text-to-image is use case is for porn.
评论 #31487078 未加载
codemonkey-zetaalmost 3 years ago
Probably just a frontend coding mistake, and not an error in the model, but in the interactive example if you select:<p>&quot;A photo of a Shiba Inu dog Wearing a (sic) sunglasses And black leather jacket Playing guitar In a garden&quot;<p>The Shiba Inu is not playing a guitar.
评论 #31486832 未加载
评论 #31486908 未加载
hahajkalmost 3 years ago
Off topic, but this caught my attention:<p>“In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access.”<p>I work for a big org myself, and I’ve wondered what it is exactly that makes people in big orgs so bad at saying things.
评论 #31490526 未加载
neolanderalmost 3 years ago
It really does look better than DALL-E, at least from the images on the site. Hard to believe how quickly progress is being made to lucid dreaming while awake.
Jyaifalmost 3 years ago
Jesus Christ. Unlike DALL-E 2, it gets the details right. It also can generate text. The quality is insanely good. This is absolutely mental.
评论 #31484997 未加载
ShakataGaNaialmost 3 years ago
All of these AI findings are cool in theory. But until its accessible to some decent amount of people&#x2F;customers - its basically useless fluff.<p>You can tell me those pictures are generated by an AI and I might believe it, but until real people can actually test it... it&#x27;s easy enough to fake. This page isn&#x27;t even the remotest bit legit by the URL, It looks nicely put together and that&#x27;s about it. Could have easily put together this with a graphic designer to fake it.<p>Let be clear, I&#x27;m not actually saying it&#x27;s fake. Just that all of these new &quot;cool&quot; things are more or less theoretical if nothing is getting released.
评论 #31485373 未加载
colinmhayesalmost 3 years ago
I wondered why all the pictures at the top had sunglasses on, then I saw a couple with eyes. Still some work to do on this one.
mistrial9almost 3 years ago
Reading a relatively-recent Machine Learning paper from some elite source, and after multiple repititions of bragging and puffery, in the middle of the paper, the charts show that they had beaten the score of a high-ranking algorithm in their specific domain, moving the best consistant result from 86% accuracy to 88% accuracy, somewhere around there. My response was: they got a lot of attention within their world by beating the previous score, no matter how small the improvement was.. it was a &quot;winner take all&quot; competition against other teams close to them; the accuracy of less than 90% is really of questionable value in a lot of real world problems; it was an enormous amount of math and effort for this team to make that small improvement.<p>What I see is semi-poverty mindset among very smart people who appear to be treated in a way such that the winners get promotion, and everyone else is fired. That this sort of analysis with ML is useful for massive data sets at scale, where 90% is a lot of accuracy, not at all for the small sets of real world, human-scale problems where each result may matter a lot. The amount of years of training that these researchers had to go through, to participate in this apparently ruthless environment, are certainly like a lottery ticket, if you are in fact in a game where everyone but the winner has to find a new line of work. I think their masters live in Redmond, if I recall.. not looking it up at the moment.
评论 #31486909 未加载
评论 #31489784 未加载
discmonkeyalmost 3 years ago
For people complaining that they can&#x27;t play with the model... I work at Google and I also can&#x27;t play with the model :&#x27;(
评论 #31490675 未加载
评论 #31493898 未加载
评论 #31487187 未加载
评论 #31488238 未加载
评论 #31487357 未加载
评论 #31489282 未加载
评论 #31493578 未加载
评论 #31489512 未加载
dr_dshivalmost 3 years ago
How the fck are things advancing so fast? Is it about to level off …or extend to new domains? What’s a comparable set of technical advances?
评论 #31486086 未加载
评论 #31486090 未加载
FargaColoraalmost 3 years ago
This looks incredible but I do notice that all the images are of a similar theme. Specifically there are no human figures.
评论 #31485554 未加载
fortran77almost 3 years ago
&gt; At this time we have decided not to release code or a public demo.<p>Oh well.
rishabhjainalmost 3 years ago
Used some of the same prompts and generated results with open source models, model I am using fails on long prompts but does well on short and descriptive prompts. Results:<p><a href="https:&#x2F;&#x2F;imgur.com&#x2F;gallery&#x2F;6qAK09o" rel="nofollow">https:&#x2F;&#x2F;imgur.com&#x2F;gallery&#x2F;6qAK09o</a>
davikralmost 3 years ago
Interesting and cool technology - but I can&#x27;t seem to ignore that every high-quality AI art application is always closed, and I don&#x27;t seem to buy the ethics excuse for that. The same was said for GPT, yet I see nothing but creativity coming out from its users nowadays.
评论 #31485800 未加载
评论 #31486113 未加载
评论 #31485776 未加载
评论 #31485796 未加载
评论 #31486864 未加载
评论 #31486650 未加载
评论 #31485673 未加载
评论 #31487203 未加载
评论 #31486781 未加载
评论 #31487591 未加载
spyremeownalmost 3 years ago
Jesus, this is so awesome. I think it’s the first AI that really makes me have that “wow” sensation.
alimovalmost 3 years ago
Would it be bad to release this with a big warning and flashing gifs letting people know of the issues it has and note that they are working to resolve them &#x2F; ask for feedback &#x2F; mention difficulties related to resolving the issues they identified?
londons_explorealmost 3 years ago
&gt;Figure 2: Non-cherry picked Imagen samples<p>Hooray! Non-cherry-picked samples should be the norm.
the__alchemistalmost 3 years ago
I&#x27;ll be skeptical until I see it in action, vice pre-selected results.
sexy_pandaalmost 3 years ago
Would I have to implement this myself, or is there something ready to run?
评论 #31485144 未加载
addajonesalmost 3 years ago
This is absolutely amazingly insane. Wow.
shannifinalmost 3 years ago
Nice to see another company making progress in the area. I&#x27;d love to see more examples of different artistic styles though, my favorite DALL-E images are the ones that look like drawings.
wiz21calmost 3 years ago
I find it a bit disturbing that they talk about social impact of totally imaginary pictures of racoon.<p>Of course, working in a golden lab at Google may twist your views on society.
评论 #31490614 未加载
bergentyalmost 3 years ago
Primarily Indian origin authors on both the DALL-E and this research paper. Just found that impressive considering they make up 1% of the population in the US.
davelondonalmost 3 years ago
I&#x27;M SQUEEZING MY PAPER!
ComputerGurualmost 3 years ago
It seems to have the same &quot;adjectives bleed into everything problem&quot; that Dall-E does.<p>Their slider with examples at the top showed a prompt along the lines of &quot;a chrome plated duck with a golden beak confronting a turtle in a forest&quot; and the resulting image was perfect - except the turtle had a golden shell.
Reidenalmost 3 years ago
What&#x27;s the limiting factor for model replication by others? Amount of compute? Model architecture? Quality &#x2F; Quantity of training data? Would really appreciate insights on the subjects
B1FF_PSUVMalmost 3 years ago
Also, almost 40 years ago, the name of a laser printer capable of 200 dpi.<p>Almost there, the Apple Laserwriter nailed it at 300 dpi.<p>Sometimes sneaked an issue of the &quot;SF-Lovers Digest&quot; in between code printouts.
faizshahalmost 3 years ago
What&#x27;s the best open source or pre-trained text to image model?
评论 #31486720 未加载
t0mkalmost 3 years ago
This would generate great music videos for Bob Dylan songs. I&#x27;m thinking Gates of Eden, &quot;.. upon four legged forest cloud, the cowboy angel rides&quot; :D
syspecalmost 3 years ago
I&#x27;m curious why all of these tools seem to be almost tailored toward making meme images?<p>The kind of early 2010&#x27;s, over the top description of something that&#x27;s ridiculous
评论 #31486421 未加载
评论 #31486082 未加载
rhackeralmost 3 years ago
Next phase of all this: Image to 3d printable template files compatible with various market available printers.<p>Print me a racoon in a leather jacket riding a skateboard.
ma2rtenalmost 3 years ago
I get the impression that maybe DALL-E 2 produces slightly more diverse images? Compare Figure 2 in this paper with Figures 18-20 in the DALL-E 2 paper.
causialmost 3 years ago
One thing I find particularly fascinating is that all the elements of the resulting image have a relatively cohesive art style.
braingeniousalmost 3 years ago
This is super cool and I want to play with it.
xyzalalmost 3 years ago
Tangentially related question: what is the best (~latest?) such a network uploaded to Colab one can toy with?
butzalmost 3 years ago
Looking at example pictures it seems that this model has trouble with putting sunglasses on a racoon.
octocopalmost 3 years ago
Would be awesome to see a side by side comparison to DALL-E, generating from the same text
评论 #31489632 未加载
planbalmost 3 years ago
Seeing the artificial restrictions to this model as well as to DALL-E 2, I can&#x27;t help but ask myself why the porn industry isn&#x27;t driving its own research. Given the size of that industry and the sheer abundance of training material, it seems just a matter of time until you can create photo realistic images of yourself with your favourite celebrity for a small fee. Is there anything I am missing? Can you only do this kind of research at google or openai scale?
评论 #31488304 未加载
评论 #31490037 未加载
评论 #31488315 未加载
james-redwoodalmost 3 years ago
Metacalculus, a mass forecasting site, has steadily brought forward the prediction date for a weakly general AI. Jaw-dropping advances like this, only increase my confidence in this prediction. &quot;The future is now, old man.&quot;<p><a href="https:&#x2F;&#x2F;www.metaculus.com&#x2F;questions&#x2F;3479&#x2F;date-weakly-general-ai-system-is-devised&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.metaculus.com&#x2F;questions&#x2F;3479&#x2F;date-weakly-general...</a>
评论 #31485177 未加载
评论 #31487847 未加载
评论 #31484896 未加载
Mo3almost 3 years ago
Is the source in public domain already?
SnowHill9902almost 3 years ago
Could there exist a quine for this?
jeffbeealmost 3 years ago
Is there anything at all, besides the training images and labels, that would stop this from generating a convincing response to &quot;A surveillance camera image of Jared Kushner, Vladimir Putin, and Alexandria Ocasio-Cortez naked on a sofa. Jeffrey Epstein is nearby, snorting coke off the back of Elvis&quot;?
评论 #31487110 未加载
anoncowalmost 3 years ago
Another ad for DALL-E.
iuppiteralmost 3 years ago
so cool
ml_basicsalmost 3 years ago
Why is this seemingly official Google blog post on this random non-Google domain?
评论 #31485238 未加载
评论 #31485442 未加载
评论 #31485260 未加载
评论 #31485280 未加载
评论 #31485645 未加载
评论 #31485289 未加载
SemanticStrenghalmost 3 years ago
This competitor might be better for respecting spatial prepositions and photorealism but on a quick look i find the images more uncanny. DALL-E has IMHO better camera POV&#x2F;distance and is able to make artistic&#x2F;dreamy&#x2F;beautiful images. I haven&#x27;t yet seen this Google model be competitive for art and uncaniness. However progress is great and I might be wrong.
SemanticStrenghalmost 3 years ago
Note that there was a close model in 2021 ignored by all <a href="https:&#x2F;&#x2F;paperswithcode.com&#x2F;sota&#x2F;text-to-image-generation-on-coco" rel="nofollow">https:&#x2F;&#x2F;paperswithcode.com&#x2F;sota&#x2F;text-to-image-generation-on-...</a> (on this benchmark) Also what is the score of dalle v2?
SemanticStrenghalmost 3 years ago
Does it outperform DALL-E V2?
unholinessalmost 3 years ago
Certificate is expired, anyone have a mirror?
marcodiegoalmost 3 years ago
Ok. Now, how about the legality of it generating socially unacceptable images like child porn?
lxealmost 3 years ago
Hey I also wrote a neural net that generates perfect images. Here&#x27;s a static site about it. With images it definitely generated! Can you use it? Is there a source? Hah, of course not, because ethics!
xnxalmost 3 years ago
OpenAi really thought they had done something with DALL-E, then Google&#x27;s all &quot;hold my beer&quot;.
评论 #31485000 未加载