The most interesting thing about diffusion LMs that tends to be missed, are their ability to edit early tokens.<p>We know that the early tokens in an autoregressive sequence disproportionately bias the outcome. I would go as far as to say this is some of the magic of reasoning models is they generate so much text they can kinda get around this.<p>However, diffusion seems like a much better way to solve this problem.
I don't get where the author is coming from with the idea that a diffusion based LLM would hallucinate less.<p>> dLLMs can generate certain important portions first, validate it, and then continue the rest of the generation.<p>If you pause the animation in the linked tweet (not the one on the page), you can see that the intermediate versions are full of, well, baloney.<p>(and anyone who has messed around with diffusion based image generation knows the models are perfectly happy to hallucinate).
I'm personally happy to see effort in this space simply because I think it's an interesting set of tradeoffs (compute ∝ accuracy) - a departure from the fixed next token compute budget required now.<p>It brings up interesting questions, like what's the equivalency between smaller diffusion models which consume more compute because they have a greater number of diffusion steps compared to larger traditional LLMs which essentially have a single step. How effective is decoupling the context window size to the diffusion window size? Is there an optimum ratio?
There is a disproportionate skepticism in autoregressive models and a disproportionate optimism in alternative paradigms because of the absolutely non verifiable idea that LLMs, when predicting the next token, don't already model, in the activation states, the gist of what they could going to say, similar to what humans do. That's funny because many times it can be observed in the output of truly high quality replies that the first tokens only made sense <i>in the perspective</i> of what comes later.
Interestingly, that animation at the end <i>mainly</i> proceeds from left to right, with just some occasional exceptions.<p>So I followed the link, and gave the model this bit of conversation starter:<p>> <i>You still go mostly left to right.</i><p>The denoising animation it generated went like this:<p>> [Yes] [.] [MASK] [MASK] [MASK] ... [MASK]<p>and proceeded by deletion of the mask elements on the right one by one, leaving just the "Yes.".<p>:)
I think these models would get interesting at extreme scale. Generate a novel in 40 iterations on a rack of GPUs.<p>At some point in the future, you will be able to autogen a 10M line codebase in a few seconds on a giant GPU cluster.
The animation on the page looks an awful lot like autoregressive inference in that virtually all of the tokens are predicted in order? But I guess it doesn't have to do that in the general case?
That got me thinking that it would be nice to have something like ComfyUi to work with diffusion based LLMs. Apply LORAs, use multiple inputs, have multiple outputs.<p>Something akin to ComfyUi but for LLMs would open up a world of possibilities.
this is the huggingface page <a href="https://huggingface.co/papers/2502.09992" rel="nofollow">https://huggingface.co/papers/2502.09992</a>
This was a very cool paper about using diffusion language models and beam search: <a href="https://arxiv.org/html/2405.20519v1" rel="nofollow">https://arxiv.org/html/2405.20519v1</a><p>Just looking at all of the amazing tools and workflows that people have made with ComfyUI and stuff makes me wonder what we could do with diffusion LMs. It seems diffusion models are much more easily hackable than LLMs.
How do diffusion LLMs decide how long the output should be? Normal LLMs generate a stop token and then halt. Do diffusion LLMs just output a fixed block of tokens and truncate the output that comes after a stop token?
I guess the biggest limitation of this approach is that the max output length is fixed before generation starts. Unlike autoregressive LLM, which can keep generating forever.
See also this recent post about Mercury-Coder from Inception Labs. There's a "diffusion effect" toggle for their chat interface but I have no idea if that's an accurate representation of the model's diffusion process or just some randomly generated characters showing what the diffusion process looks like<p><a href="https://news.ycombinator.com/item?id=43187518">https://news.ycombinator.com/item?id=43187518</a><p><a href="https://www.inceptionlabs.ai/news" rel="nofollow">https://www.inceptionlabs.ai/news</a>
I know the r-word is coming back in vogue, but it was still unpleasant to see it in the middle of an otherwise technical blog post. Ah well.<p>Diffusion LMs are interesting and I'm looking forward to seeing how they develop, but from playing around with that model, it's GPT-2 level. I suspect it will need to be significantly scaled up before we can meaningfully compare it to the autoregressive paradigm.