Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

239 点作者 lnyan大约 2 年前

18 条评论

Whenever I see these text to video diffusion models, I always think that a much better result would come out of some sort of multi-modal vision-aware AI agent were able to use 3d modeling tools like blender, continuously iterate it, work with physics and particle simulation tools then refine the less realistic details with a diffusion model based filter. It would also allow all the assets created along the way to be reused and refined.I imagine it will be the de-facto way AI generates video in the next few years, once huge models are smart enough to use tools and work in high breadth and depth jobs like humans do.

评论 #35626089 未加载

评论 #35626008 未加载

评论 #35625883 未加载

Lerc大约 2 年前

It's a bit light on details beyond trained-on-video, and interpolating between keyframes in latent space.The stable diffusion community is getting impressive results generating keyframes in a batch and using existing interpolation mechanisms on the decoded frames.<a href="https://www.reddit.com/r/StableDiffusion/comments/12o8qm3/finally_installed_the_newer_controlnet_models_a/" rel="nofollow">https://www.reddit.com/r/StableDiffusion/comments/12o8qm3/fi...</a>That example is using control-nets to induce the animation they want but it would be quite easy to train a model on a sequence of video frames in the same layout.

评论 #35626372 未加载

blauditore大约 2 年前

The results are impressive, but also serious nightmare fuel. I suppose that's a good example of the "uncanny valley": <a href="https://en.wikipedia.org/wiki/Uncanny_valley" rel="nofollow">https://en.wikipedia.org/wiki/Uncanny_valley</a>

评论 #35628016 未加载

评论 #35626657 未加载

评论 #35625885 未加载

评论 #35625857 未加载

152334H大约 2 年前

No model, no code.About as exciting as Imagen-video, which was released eons ago.

评论 #35625153 未加载

评论 #35626063 未加载

评论 #35625034 未加载

cornedor大约 2 年前

A lot of those examples have a Shutterstock watermark. I doubt that Shutterstock allows the use of unlicensed videos for this use case.

评论 #35625613 未加载

评论 #35626160 未加载

评论 #35629963 未加载

评论 #35626183 未加载

评论 #35630378 未加载

ShamelessC大约 2 年前

> † Andreas, Robin and Tim did the work during internships at NVIDIA.Some pretty experienced/expensive interns you got there, NVIDIA.

GaggiX大约 2 年前

I give a summary about the training of the text-to-video model: they used 48 GPUs for the keyframes model 128k steps, 24 GPUs for the interpolation model 14k steps, 32 GPUs for the upscaler model 10k steps, 40 GPUs for the autoencoder 35k steps. They used A100-80G in all cases except the autoencoder (A100-40G).This is not particularly expensive considering the numbers these companies usually use, thanks to the fact that they only trained the temporal layers in the LDM and the autoencoder.

Yenrabbit大约 2 年前

Some things to note: - This was finished last year, impressive now but even more so back then - No code or models released BUT several authors have moved to StabilityAI and are working on their own improved open video models, which is hopeful as the field continues to move forward - The paper uses existing image models as a base, and so a better base model (the new XL stable diffusion variant, or Midjourney's underlying model) will give even better results.

marapuru大约 2 年前

The generated 'driving scene' videos eerily resemble a lot of crash dashcam video's. I even have the idea that the bottom right video is composited of video's right before a crash. The white SUV seems to make a hard steer right, but then just continues straight on. It even looks as if some wreckage is visible flying over the road.

评论 #35625317 未加载

robarr大约 2 年前

I understand the technical marvel, but there are obvious inconsistencies in all the images and I would call them more animations than video sequences.

评论 #35625110 未加载

评论 #35624945 未加载

lelag大约 2 年前

The fox dressed in suit dancing in park is obviously Roald Dahl's Fantastic Mr Fox and his tail was just shot by the farmers it seems.

hasa大约 2 年前

That guitar teddybear sample clearly played Stevie Ray Vaughan's "Scuttle Buttin'". Good taste for music!

Aissen大约 2 年前

I was curious about the use of "sks" in the examples. It is explained in the paper:> Using DreamBooth [66], we fine-tune our Stable Diffusion spatial backbone on small sets of images of certain objects, tying their identity to a rare text token (“sks”).I wonder how long that token will stay "rare".

评论 #35625775 未加载

m0llusk大约 2 年前

This seemed like it might be an interesting article to read, but it turns out that clicking on this link initiated a huge many megabyte download of some kind of media experience that totally locked up my machine. I had a bunch of apps going and was hoping for a smooth start to the day. Had this been an actual article to read then it could have loaded and been put in the background. If you are going to link to some hypermedia experience that requires a high end machine, good net connection, and well configured ad blockers then there should be a warning. This is not some taste issue about me not liking the custom font or whatever. Much of the population does their work on lower end machines with modest net connections and this site is not compatible with that. Do you really want HN to be a site only for the ascended? If you can't communicate without an avalanche of crud coming along then maybe you need to consider what you are saying and maybe consult an editor or something.

评论 #35631716 未加载

评论 #35630190 未加载

usrusr大约 2 年前

An interesting take on the "diffusion can't do hands" trope: take a teddy bear (the most finger-less humanoid shape known to man, and now also to machine I guess) and put it in that one scene where all attention is on fingers (guitar solo)

Rakib_7大约 2 年前

I think their main focus is generate more real lift simulation of driving and creating many scenarios to train on. Its good for the autonomous car mostly targeting Cruise and Tesla.

j45大约 2 年前

It's interesting the generative video attempts aren't starting with something simpler and and growing from there with the audience.No model, no code makes it a challenge to explore.

Fauntleroy大约 2 年前

A bit ridiculous seeing the much-maligned "trending on artstation" keywords on nVidia's own tech demo page.

评论 #35624734 未加载