Can someone more knowledgeable than me help me understand a few points about this article ?<p>It claims to be diffusion-based, but the main 2 differences from an approach like Stable-Diffusion is that (1) they only consider a single step, instead of a traditional 1000 and (2) they directly predict the value z^y instead of a noise direction. According to their analyses, both of these differences help in the studied tasks. However, isn't that how supervised learning has always worked ? Aside from having a larger model, this isn't very different from "traditional" depth estimation that don't claim anything to do with diffusion.<p>It also claims to have zero-shot abilities, but they fine-tune the denoising model f_theta on a concatenation of the latent image and apply a loss using the latent label. So their evaluation dataset may be out-of-distribution, but I don't understand how that's zero-shot. Asking ChatGPT to output a depth estimation of a given image would be zero-shot because it hasn't been trained to do that (to my knowledge).