I had to dive deep into this to gain an intuition for it. Basically diffusion models work by de-noising a noise sample step by step until the result is something that looks like the training data (like an image of a frog).<p>To train a diffusion model, you take the image of the frog and add noise to it one step at a time. You then teach a neural network such as a U-Net how to remove noise from any given step to the previous (less noisy) step. With millions of samples, the model eventually learns to do this.<p>Flow matching is similar, but the key difference is that training involves simply taking an image sample and smoothly interpolating it straight to the noise. You aren’t adding noise “step by step” but rather literally running interpolate(image, noise, step_count) to generate an array of intermediate steps on the way to noise.<p>The flow matching neural network is then trained to product a vector field that knows precisely how much to nudge each dimension of a sample at any given time step so that it looks like a sample at the previous time step. This much is very similar to DDIM diffusion models.<p>However what’s being predicted is the vector field rather than the de-noising operation.<p>Flow matching allows you to perform de-noising using an off the shelf super efficient ordinary differential equation solver, because those things natively work with vector fields. And since the whole process is fully deterministic, you can jam your noise sample in and let the ODE do the rest.<p>The flow matching approach produces better output because it is approximating a linear interpolation process in reverse, rather than a step by step removal of noise. This means that the network is somewhat more likely to produce output that matches the data distribution, all things being held equal.<p>Training flow matching networks is apparently harder, though, because vector fields tend to accumulate errors and so you need to make
careful use of regularization to ensure the network doesn’t go too wild.