A number of ideas seem notable to me here; first, they are merging the idea of sequence masking (the key training idea for LLMs) with diffusion models; they do this by keeping track of an ‘uncertainty’ level per pixel. This ‘uncertainty’ level is treated as the ‘noise’ level for the diffusion model, (a model which denoises controlled by some sort of embedding).<p>There are a bunch of neat things you can do with this: in particular, you can firm up parts of the image earlier than others, and thus use it for, say maze solving. They even show it controlling a robot arm moving fruit around, which is pretty wild.<p>In a way the title undersells the idea - this is a way to do <i>fractional</i> masking, since the masking level is a float - and I think is really a pretty profound and interesting idea.<p>However, there’s a lot not talked about in this paper; I’d be very curious to see their codebase. <i>How</i> exactly do you set up a maze-following task vs a video extension task? How do you hook up a robot arm to this model, and tell the model what you want done? The architecture itself deserves a significant number of papers / explication.