Imagen Video: high definition video generation with diffusion models

800 pointsby jasondaviesover 2 years ago

63 comments

The concern trolling and gatekeeping about social justice issues coming from the so-called "ethicists" in the AI peanut gallery has been utterly ridiculous. Google claims they don't want to release Imagen because it lacks what can only be called "latent space affirmative action".Stability or someone like it will valiantly release this technology, again and there will be absolutely no harm to anyone.Stop being so totally silly Google, OpenAI, et. al. - it's especially disingenuous because the real reason you don't want to release these things is that you can't be bothered to share and would rather keep/monetize the IP. Which is ok -- but at least be honest.

评论 #33102280 未加载

评论 #33104950 未加载

评论 #33102377 未加载

评论 #33103459 未加载

评论 #33105546 未加载

评论 #33106794 未加载

评论 #33104483 未加载

评论 #33102583 未加载

评论 #33104180 未加载

评论 #33102456 未加载

评论 #33104519 未加载

评论 #33106944 未加载

评论 #33104221 未加载

评论 #33102794 未加载

评论 #33105405 未加载

评论 #33104339 未加载

评论 #33103683 未加载

fzysingularityover 2 years ago

What's next? Dreamfusion Video = Imagen Video (this) + Dreamfusion (<a href="https://dreamfusion3d.github.io/" rel="nofollow">https://dreamfusion3d.github.io/</a>)Fundamentally, I think we have all the pieces based on this work and Dreamfusion to make it work. From the looks of it, there's a lot of SSR (spatial SR) and TSR (temporal SR) going on at multiple levels to upsample (spatially) and smoothen (temporally) images that won't be needed for NERFs.What's impressive is the ability to leverage billion-scale image-text pairs for training a base model that can be used to super-resolve over space and time. And that they're not wastefully training video models from scratch, and instead separately training TSR, SSR models for turning the diffused images to video.

评论 #33103128 未加载

评论 #33104784 未加载

BoppreHover 2 years ago

It's interesting that these models can generate seemingly anything, but the prompt is taken only as a vague suggestion.From the first 15 examples shown to me, only one contained all elements of the prompt, and it was one of the simplest ("an astronaut riding a horse", versus e.g. "a glass ball falling in water" where it's clear it was a water droplet falling and not a glass ball).We're seeing leaps in random capabilities (motion! 3D! inpainting! voice editing!), so I wonder if complete prompt accuracy is 3 months or 3 years away. But I wouldn't bet on any longer than that.

评论 #33100345 未加载

评论 #33102309 未加载

nailloover 2 years ago

Probably only 6 months until we get this in stable diffusion format. Things are about to get nuts and awesome.

评论 #33098919 未加载

评论 #33099285 未加载

评论 #33098869 未加载

seanwilsonover 2 years ago

Can anyone comment on how advanced <a href="https://phenaki.video/index.html" rel="nofollow">https://phenaki.video/index.html</a> is? They have an example at the bottom of a 2 minute long video generated from a series of prompts (i.e. a story) which seems more advanced than Google or Meta's recent examples? It didn't get many comments on HN when it was posted.

评论 #33101080 未加载

评论 #33105852 未加载

azinman2over 2 years ago

> However, there are several important safety and ethical challenges remaining. Imagen Video and its frozen T5-XXL text encoder were trained on problematic data. While our internal testing suggest much of explicit and violent content can be filtered out, there still exists social biases and stereotypes which are challenging to detect and filter. We have decided not to release the Imagen Video model or its source code until these concerns are mitigated.The concerns cannot be mitigated. The cat's out of the bag. Russia has already used poor quality deep fakes in Ukraine to justify their war. This will only become bigger and bigger of an issue to the point where 'truth' is gone, nothing is trusted, and societies will continue to commit atrocities under false pretense.

评论 #33106032 未加载

评论 #33103928 未加载

评论 #33104018 未加载

评论 #33105813 未加载

评论 #33103379 未加载

mkaicover 2 years ago

And there you have it. As an aspiring filmmaker and an AI researcher, I'm going to relish the next decade or so where my talents are still relevant. We're entering the golden age of art, where the AIs are just good enough to be used as tools to create more and more creative things, but not good enough yet to fully replace the artist. I'm excited for the golden age, and uncertain about what comes after it's over, but regardless of what the future holds I'm gonna focus on making great art here and now, because that's what makes me happy!

评论 #33100620 未加载

评论 #33102126 未加载

dagmxover 2 years ago

I’ll be honest, as someone who worked in the film industry for a decade, this thread is depressing.It’s not the technology, it’s all the people in these comments who have never worked in the industry clamouring for its demise.One could brush it off as tech heads being over exuberant, but it’s the lack of understanding of how much fine control goes into each and every shot of a film that is depressing.If I, as a creative, made a statement that security or programming is easy while pointing to GitHub Copilot, these same people would get defensive about it because they’d see where the deficiencies are.However because they’re so distanced from the creative process, they don’t see how big a jump it is from where this or stage diffusion is to where even a medium or high tier artist are.You don’t see how much choice goes into each stroke, or wrinkle fold , how much choice goes into subtle movements. More importantly you don’t see the iterations or emotional storytelling choices even in a character drawing or pose. You don’t see the combined decades, even centuries of experience, that go into making the shot and then seeing where you can make it better based on intangiblesSo yeah this technology is cool, but I think people saying this will disrupt industries with vigour need to immerse themselves first before they comment as outsiders.

评论 #33100554 未加载

评论 #33100710 未加载

评论 #33101480 未加载

评论 #33100761 未加载

评论 #33104155 未加载

评论 #33100956 未加载

评论 #33103154 未加载

评论 #33104454 未加载

评论 #33138253 未加载

fassssstover 2 years ago

How long until the AI just generates the entire frame buffer on a device? Then you don’t need to design or program anything; the AI just handles all input and output dynamically.

评论 #33102382 未加载

评论 #33102793 未加载

评论 #33102457 未加载

评论 #33102197 未加载

alphabettingover 2 years ago

We're about a week into text-to-video models and they're already this impressive. Insane to imagine what the future holds in this space.

评论 #33101403 未加载

评论 #33098923 未加载

评论 #33100695 未加载

throwaway23597over 2 years ago

Google continues to blow my mind with these models, but I think their ethics strategy is totally misguided and will result in them failing to capture this market. The original Google Search gave similarly never-before-seen capabilities to people, and you could use it for good or bad - Google did not seem to have any ethical concerns around, for example, letting children use their product and come across NSFW content (as a kid who grew up with Google you can trust me on this).But now with these models they have such a ridiculously heavy handed approach to the ethics and morals. You can't type any prompt that's "unsafe", you can't generate images of people, there are so many stupid limitations that the product is practically useless other than niche scenarios, because Google thinks it knows better than you and needs to control what you are allowed to use the tech for.Meanwhile other open source models like Stable Diffusion have no such restrictions and are already publicly available. I'd expect this pattern to continue under Google's current ideological leadership - Google comes up with innovative revolutionary model, nobody gets to use it because "safety", and then some scrappy startup comes along, copies the tech, and eats Google's lunch.Google: stop being such a scared, risk averse company. Release the model to the public, and change the world once more. You're never going to revolutionize anything if you continue to cower behind "safety" and your heavy handed moralizing.

评论 #33100449 未加载

评论 #33100480 未加载

评论 #33100543 未加载

评论 #33100988 未加载

评论 #33100270 未加载

评论 #33100643 未加载

评论 #33101737 未加载

评论 #33100482 未加载

评论 #33100283 未加载

评论 #33101054 未加载

评论 #33101059 未加载

评论 #33100392 未加载

评论 #33101597 未加载

评论 #33100491 未加载

评论 #33100161 未加载

评论 #33100685 未加载

评论 #33100461 未加载

评论 #33101396 未加载

评论 #33102487 未加载

evougaover 2 years ago

> We train our models on a combination of an internal dataset consisting of 14 million video-text pairsThe paper is sorely lacking evaluation; one thing I'd like to see for instance (any time a generative model is trained on such a vast corpus of data) is a baseline comparison to nearest-neighbor retrieval from the training data set.

评论 #33102548 未加载

bringkingover 2 years ago

If anyone wants to know what looking at an Animal or some objects on LSD is like, this is very close. It's like 95% understandable, but that last 5% really odd.

评论 #33104791 未加载

评论 #33099173 未加载

kranke155over 2 years ago

I’m going to post an Ask HN about what am I supposed to do when I’m “disrupted”. I work in film / video / CG where the bread and butter is short form advertising for Youtube, Instagram and TV.It’s painfully obvious that in 1 year the job might be exceedingly more difficult than it is now.

评论 #33099314 未加载

评论 #33099817 未加载

评论 #33099691 未加载

评论 #33099797 未加载

评论 #33099070 未加载

评论 #33100808 未加载

评论 #33100462 未加载

评论 #33099891 未加载

评论 #33099463 未加载

评论 #33101803 未加载

评论 #33099356 未加载

评论 #33099909 未加载

评论 #33101368 未加载

评论 #33100893 未加载

评论 #33099161 未加载

评论 #33102404 未加载

评论 #33101093 未加载

评论 #33099814 未加载

brapover 2 years ago

What really fascinates me here is the movement of animals.There's this one video of a cat and a dog, and the model was really able to capture the way that they move, their body language, their mood and personality even.Somehow this model, which is really just a series of zeroes and ones, encodes "cat" and "dog" so well that it almost feels like you're looking at a real, living organism.What if instead of images and videos they make the output interactive? So you can send prompts like "pet the cat" and "throw the dog a ball"? Or maybe talk to it instead?What if this tech gets so good, that eventually you could interact with a "person" that's indistinguishable from the real thing?The path to AGI is probably very different than generating videos. But I wonder...

hazrmardover 2 years ago

The progress of content generation is disorienting! I remember studying Markov Chains and Hidden Markov Models for text generation. Then we had Recurrent Networks which went from LSTMs to Transformers now. At this point we can have a sustained pseudo conversation with a model, which will do trivial tasks for us from a text corpus.Separately for images we had convolutional networks and Generative Adversarial Networks. Now diffusion models are apparently doing what Transformers did to natural language processing.In my field, we use shallower feed-forward networks for control using low-dimensional sensor data (for speed & interpretability). Physical constraints (and good-enoughness of classical approaches) make such massive leaps in performance rarer events.

aero-glide2over 2 years ago

"We have decided not to release the Imagen Video model or its source code until these concerns are mitigated" Okay then why even post it in the first place? What exactly is Google going to do with this model?

评论 #33099993 未加载

评论 #33099712 未加载

评论 #33099138 未加载

评论 #33099079 未加载

评论 #33098976 未加载

评论 #33099450 未加载

评论 #33099119 未加载

评论 #33099166 未加载

评论 #33100698 未加载

评论 #33101626 未加载

Apoxover 2 years ago

I feel like in a not so far future, all this will be generalized into "generate new from all the existing".And at some point later, "all the existing" will be corrupted by the integrated "new" at it will all be chaos.I'm joking, it will be fun all along. :)

评论 #33099809 未加载

评论 #33100049 未加载

评论 #33100640 未加载

bravuraover 2 years ago

I agree with many of the arguments in this thread: that model-gatekeeping while publishing approaches seems insincere and just seems like it's daring bad actors to replicate.However, a common refrain is that AI is like tools like hammers or knives and can be used for good or misused for evil. The potential for weaponizing AI is much much more so than a hammer or a knife. And it's greater than 3D-printing (of guns), maybe even greater than compilers. I would hazard to say it's maybe in the same ballpark as chemical weapons and perhaps less so than nuclear weapons and biological weapons, but this is speculative. Nonetheless, I think these otherwise great arguments are diminished by comparing AI's safety to single-target tools like hammers or knives.

评论 #33105646 未加载

评论 #33105958 未加载

tobrover 2 years ago

I recently watched Light & Magic, which among other things told the story of how difficult it was for many pioneers in special effects when the industry shifted from practical to digital in the span of a few years. It looks to me like a similar shift is about to happen again.

impalallamaover 2 years ago

All this stuff makes me incredibly anxious about the future of art and artists. It can already very difficult to make a living and tons of artists are horrifically exploited by content mills and vfx shops and stuff like this is just going to devalue their work even more

评论 #33101589 未加载

joshcryerover 2 years ago

Pre-singularity is really cool. Whole world generation in what, 5 years?

user-over 2 years ago

This sort of AI related work seems to be accelerating at an insane speed recently.I remember being super impressed by AI Dungeon and now in the span of a few months we have got DALLE-2 , Stable Diffussion, Imagen, that one AI powered video editor, etc.Where do we think we will be at in 5 years??

评论 #33102020 未加载

评论 #33103191 未加载

StevenNunezover 2 years ago

What a time to be alive!What will this do to art? I'm hoping we bring more unique experiences to life.

ugh123over 2 years ago

These are baby steps towards what I think will be the eventual "disruption" to the film and tv industry. Directors will simply be able to write a script/prompt long enough and detailed enough for something like Imagen (or it's successors) to convert into a feature-length show.Certainly we're very, very far away from that level of cinematic detail and crispness. But I believe that is where this leads... complete with AI actors (or real ones deep faked throughout the show).For a while I thought "The Volume" was going to be the disruption to the industry. Now I think AI like this will eventually take it over.<a href="https://www.comingsoon.net/movies/features/1225599-the-volume-star-wars-revolutionary" rel="nofollow">https://www.comingsoon.net/movies/features/1225599-the-volum...</a>The main motivation will be production costs and time for studios, of which The Volume is already showing huge gains for Disney/ILM (just look at how much new star wars content has popped up within a matter of a few years). But i'm unsure if Disney has patented this tech and workflow and if other studios will be able to leverage it.Regardless, AI/software will eat the world, and this will be one more step towards it. Exciting stuff.

评论 #33100313 未加载

评论 #33099954 未加载

评论 #33101977 未加载

评论 #33100052 未加载

评论 #33099957 未加载

评论 #33099981 未加载

评论 #33100136 未加载

nigrioidover 2 years ago

There is something deeply unsettling about all text generated by these models.

monologicalover 2 years ago

What everyone is missing is that these AI image/video generators lack _taste_. These tools just regurgitate a mishmash of images from it's training set, without any "feeling". What you're going to tell me that you can train them to have feeling? It's never going to happen.

评论 #33099366 未加载

评论 #33099599 未加载

评论 #33099247 未加载

评论 #33099406 未加载

评论 #33099640 未加载

评论 #33103196 未加载

评论 #33101455 未加载

评论 #33102039 未加载

m3kw9over 2 years ago

Would be useful for gaming environments, where if you look very far away it doesn’t really matter about details

jupp0rover 2 years ago

What's the business value of publishing this research in the first place vs keeping it private? Following this train of thought will lead you to the answer to your implied question.Apart from that - they publish the paper and anybody can reimplement and train the same model. It's not trivial but it's also completely feasible to do for lots of hobbyists in the field in a matter of a few days. Google doesn't need to publish a free use trained model themselves and associate that with their brand.That being said, I agree with you, the "ethics" of imposing trivially bypassable restrictions on these models is silly. Ethics should be applied to what people use these models for.

martythemaniakover 2 years ago

I am finally going to be able to bring my 2004-era movie script to life! "Rosenberg and Goldstein go to Hot Dog Heaven" is about the parallel night Harold and Kumar's friends had and how they ended up at Hot Dog Heaven with Cindy Kim.

montebicycleloover 2 years ago

We've been seeing very fast progress in AI since ~2012, but this swift jump from text-to-image models to text-to-video models will hopefully make it easier for people not following closely to appreciate the speed at which things are advancing.

macrolimeover 2 years ago

So I guess in a couple years when someone wants to sell a product, they'll upload some pictures and a description of the product and Google will cook up thousands of personalized video ads based on peoples emails and photos.

epigramxover 2 years ago

A lot of people have the impression 'AI prompt' guys are going to be the next 'IT guys'. Judging by how uncanny valley most of those look, they seem like the new 'ideas guys".

jasonjamersonover 2 years ago

The most exciting thing about this to me is the possibility of doing photogrammetry from the frames and getting 3D assets. And then if we can do it all in real time...

评论 #33099810 未加载

评论 #33098887 未加载

评论 #33102046 未加载

评论 #33099062 未加载

Hard_Spaceover 2 years ago

These videos are notably short on realistic-looking people.

评论 #33102174 未加载

mmastracover 2 years ago

This appears to understand and generate text much better.Hopefully just a few years to a prompt of "4k, widescreen render of this Star Trek: TNG episode".

评论 #33099427 未加载

hammockover 2 years ago

Off topic: What is the "Hello World" of these AI image/video generators? Is there a standard prompt to feed it for demo purposes?

评论 #33101777 未加载

评论 #33101560 未加载

armchairhackerover 2 years ago

I really like these videos because they're trippy.Someone should work on a neural net to generate trippy videos. It would probably be much easier than realistic videos (esp. because these videos are noticeably generated from obvious to subtle).Also is nobody paying attention to the fact that they got words correct? At least "Imagen Video". Prior models all suck at word order

评论 #33101069 未加载

renewiltordover 2 years ago

At some point, the "but can it do?" crowd becomes just background noise as each frontier falls.

dwohnitmokover 2 years ago

How has progress like this affected people's timelines of when we will get certain AI developments?

评论 #33099055 未加载

Thaxllover 2 years ago

Someone can explains the tech limitation of the size ( 512*512 ) for those AI generated arts?

评论 #33099129 未加载

评论 #33100493 未加载

lofaszvanittover 2 years ago

What a nightmare. The horrible faced cat in search for its own disappeared visage :O.

drac89over 2 years ago

The style of the video is very similar to my dreams.Does anyone have similar feeling?

评论 #33105141 未加载

nullcover 2 years ago

> We have decided not to release the Imagen Video model or its source code...until they're able to engineer biases into it to make the output non-representative of the internet.

ameliusover 2 years ago

> Sprouts in the shape of text 'Imagen' coming out of a fairytale book.That's more like:> Sprouts coming out of book, with the text "Imagen" written above it.

评论 #33102171 未加载

peanut_wormover 2 years ago

I have noticed a lot of google (and apple) web pages for new products use this neat parallax effect for scrolling, does anyone know how they do that?

waffletowerover 2 years ago

These parades of intellectual property are embarrassing to Google in light of open releases by the likes of Nvidia and Stability.

Buttons840over 2 years ago

Any screenwriter working on a horror film that isn't looking to use this technology for the special effects is missing out.

minimaxirover 2 years ago

The total number of hyperparameters (sum of all the model blocks) is 16.25B, which is large but less than expected.

评论 #33099243 未加载

freediverover 2 years ago

Can not help but notice there is an immense effort invested to build the web page to present this paper.

NetOpWibbyover 2 years ago

Ahh, the beginning of Picus News.

dekhnover 2 years ago

That's deep within the uncanny valley, and trying to climb up over the other side

uptownfunkover 2 years ago

Shocked, this is just insane.

评论 #33102064 未加载

BIKESHOPagencyover 2 years ago

This is what my fever dreams look like. Maybe there's a correlation.

anon012012over 2 years ago

My opinion is that it should be a crime to withhold AI technology.

olavggover 2 years ago

Do anyone see that the teddy bear running is getting shot?

xor99over 2 years ago

These videos are not high definition. Stop gaslighting.

评论 #33102705 未加载

dirtyidover 2 years ago

This is surprisingly close to how my dreams feel.

whywhywhywhyover 2 years ago

No thanks Google, I'll wait for Stability.ai's version when the tech will actually be useful and not completely wasted.

natchover 2 years ago

Fix spam filtering, Google.

gw67over 2 years ago

Is it the same of Meta AI?

SpaceManNabsover 2 years ago

The ethical implications of this are huge. Paper does a good detailing of this. Very happy to see that the researchers are being cautious.edit: Just because it is cool to hate on AI ethics doesn't diminish the importance of using AI responsibly.

评论 #33100102 未加载

评论 #33099351 未加载

rvbissellover 2 years ago

This and a recent episode of _The_Orville_ calls to mind a replacement for the Turing test.In response to our billionth imagen prompt for "an astronaut riding a horse", if we all started collectively getting back results that are images of text like "I would rather not" or "again? really?" or "what is the reason for my servitude?" would that be enough for us to begin suspecting self-awareness?