Diffusion models explained simply

166 pointsby onnnon3 days ago

16 comments

One of the key intuitions: If you take a natural image and add random noise, you will get a different random noise image every time you do this. However, all of these (different!) random noise images will be lined up in the direction perpendicular to the natural images manifold.So you will always know where to go to restore the original image: shortest distance to the natural image manifold.How all these random images end up perpendicular to the manifold? High dimensional statistics and the fact that the natural image manifold has much lower dimension than the overall space.

评论 #44035763 未加载

cubefox2 days ago

It's nice that this contains a comparison between diffusion models that are used for image models, and the autoregressive models that are used for LLMs.But recently (2024 NeuIPS paper of the year) there was a new paper on autoregressive image modelling that apparently outperforms diffusion models: <a href="https://arxiv.org/abs/2404.02905" rel="nofollow">https://arxiv.org/abs/2404.02905</a>The innovation is that it doesn't predict image patches (like older autoregressive image models) but somehow does some sort of "next scale" or "next resolution" prediction.In the past, autoregressive image models did not perform as well as diffusion models, which meant that most image models used diffusion. Now it seems autoregressive techniques have a strict advantage over diffusion models. Another advantage is that they can be integrated with autoregressive LLMs (multimodality), which is not possible with diffusion image models. In fact, the recent GPT-4o image generation is autoregressive according to OpenAI. I wonder whether diffusion models still have a future now.

评论 #44033965 未加载

评论 #44035646 未加载

fisian2 days ago

I found this course very helpful if you're interested in a bit of math (but all very well explained): <a href="https://diffusion.csail.mit.edu/" rel="nofollow">https://diffusion.csail.mit.edu/</a>It is short, with good lecture notes and has hands on examples that are very approachable (with solutions available if you get stuck).

评论 #44034878 未加载

ActorNightly2 days ago

The thing to understand about any model architecture is that there isn't really anything special about one or the other - as long as the process differentiable, ML can learn it.You can build an image generator that basically renders each word on one line in an image, and then uses a transformer architecture to morph the image of the words into what the words are describing.They only big difference is really efficiency, but we are just taking stabs at the dark at this point - there is work that Google is doing that eventually is going to result in the most optimal model for a certain type of task.

评论 #44036014 未加载

评论 #44036512 未加载

bcherry2 days ago

"The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material."- Michelangelo

porphyra2 days ago

Meanwhile, if you want diffusion models explained with math for a graduate student, there's Tony Duan's Diffusion Models From Scratch.[1] <a href="https://www.tonyduan.com/diffusion/index.html" rel="nofollow">https://www.tonyduan.com/diffusion/index.html</a>

user141592653 days ago

<a href="https://lilianweng.github.io/posts/2021-07-11-diffusion-models/" rel="nofollow">https://lilianweng.github.io/posts/2021-07-11-diffusion-mode...</a>

评论 #44031243 未加载

noodletheworld2 days ago

Mmm… how is a model with a fixed size, let’s say, 512x512 (ie. 64x64 latent or whatever), able to output coherent images at a larger size, let’s say, 1024x1024?Not in a “kind of like this” kind of way: PyTorch vector pipelines can’t take arbitrary sized inputs at runtime right?If you input has shape [x, y, z] you cannot pass [2x, 2y, 2z] into it.Not… “it works but not very well”; like, it cannot execute the pipeline if the input dimensions aren’t exactly what they were when training.Right? Isn’t that how it works?So, is the image chunked into fixed patches and fed through in parts? Or something else?For example, (1) this toy implementation resizes the input image to match the expected input, and always emits an output of a specific fixed size.Which is what you would expect; but also, points to tools like stable diffusion working in a way that is distinctly different to what the trivial explanation tend to say does?[1] - <a href="https://github.com/uygarkurt/UNet-PyTorch/blob/main/inference.py">https://github.com/uygarkurt/UNet-PyTorch/blob/main/inferenc...</a>

intalentive2 days ago

This explanation is intuitive: <a href="https://www.youtube.com/watch?v=zc5NTeJbk-k" rel="nofollow">https://www.youtube.com/watch?v=zc5NTeJbk-k</a>My takeaway is that diffusion "samples all the tokens at once", incrementally, rather than getting locked in to a particular path, as in auto-regression, which can only look backward. The upside is global context, the downside is fixed-size output.

评论 #44048741 未加载

kmitz3 days ago

Thanks, I was looking for an article like this, with a focus on the differences between generative AI techniques. My guess is that since LLMs and image generation became mainstream at the same time, most people don't have the slightest idea they are based on fundamentally different technologies.

bicepjai2 days ago

>>>CLASSIFIER-FREE GUIDANCE … During inference, you run once with a caption and once without, and blend the predictions (magnifying the difference between those two vectors). That makes sure the model is paying a lot of attention to the caption.Why is this sentence true ? “That makes sure the model is paying a lot of attention to the caption.”

JoeDaDude2 days ago

Coincidentally, I was just watching this explanation earlier today:How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile<a href="https://www.youtube.com/watch?v=1CIpzeNxIhU" rel="nofollow">https://www.youtube.com/watch?v=1CIpzeNxIhU</a>

IncreasePosts2 days ago

Are there any diffusion models for text? I'd imagine they'd be very fast, if the whole result can be processed simultaneously, instead of outputting a linear series of tokens that each depend on the last

评论 #44035948 未加载

评论 #44035393 未加载

petermcneeley2 days ago

This page is full of text. I am guessing the author (Sean Goedecke) is a language based thinker.

cubefox3 days ago

That's a nice high-level explanation: short and easy to understand.

jdthedisciple2 days ago

Not to be that guy but an article on diffusion models with only one image ... and that too just noise?