Not sure if I would tradeoff speed for accuracy.<p>Yes, it's incredible boring to wait for the AI Agents in IDEs to finish their job. I get distracted and open YouTube. Once I gave a prompt so big and complex to Cline it spent 2 straight hours writing code.<p>But after these 2 hours I spent 16 more tweaking and fixing all the stuff that wasn't working. I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.<p>I've been more and more only using the "thinking" models of o3 in ChatGPT, and Gemini / Claude in IDEs. They're slower, but usually get it right.<p>But at the same time I am open to the idea that speed can unlock new ways of using the tooling. It would still be awesome to basically just have a conversation with my IDE while I am manually testing the app. Or combine really fast models like this one with a "thinking background" one, that would runs for seconds/minutes but try to catch the bugs left behind.<p>I guess only giving a try will tell.
If the benchmarks aren't lying, Mercury Coder Small is as smart as 4o mini and costs the same, but is order of magnitude faster when outputting (unclear if pre-output delay is notably different). Pretty cool. However, I'm under the impression that 4o-mini was superceded by 4.1-mini and 4.1-nano for all use cases (correct me if I'm wrong). Unfortunately they didn't publish comparisons with the 4.1 line, which feels like an attempt to manipulate the optics. Or am I misreading this?<p>Btw, why call it "coder"? 4o-mini level of intelligence is for extracting structured data and basic summaries, definitely not for coding.
There are some open weight attempts at this around too: <a href="https://old.reddit.com/r/LocalLLaMA/search?q=diffusion&restrict_sr=on" rel="nofollow">https://old.reddit.com/r/LocalLLaMA/search?q=diffusion&restr...</a><p>Saw another on Twitter past few days that looked like a better contender to Mercury, doesn't look like it got posted to LocalLLaMa, and I can't find it now. Very exciting stuff
It fails the MU Puzzle¹ by violating rules:<p>To transform the string "AB" to "AC" using the given rules, follow these steps:<p>1. *Apply Rule 1*: Add "C" to the end of "AB" (since it ends in "B").
- Result: "ABC"<p>2. *Apply Rule 4*: Remove the substring "CC" from "ABC".
- Result: "AC"<p>Thus, the series of transformations is:
- "AB" → "ABC" (Rule 1)
- "ABC" → "AC" (Rule 4)<p>This sequence successfully transforms "AB" to "AC".<p>¹ <a href="https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-mu-puzzle/" rel="nofollow">https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-...</a>
It's nice to see a team doing something different.<p>The cost[1] is US$1.00 per million output tokens and US$0.25 per million input tokens. By comparison, Gemini 2.5 Flash Preview charges US$0.15 per million tokens for text input and $0.60 (non-thinking) output[2].<p>Hmmm... at those prices they need to focus on markets where speed is especially important, eg high-frequency trading, transcription/translation services and hardware/IoT alerting!<p>1. <a href="https://files.littlebird.com.au/Screenshot-2025-05-01-at-9.36.46-am.png" rel="nofollow">https://files.littlebird.com.au/Screenshot-2025-05-01-at-9.3...</a><p>2. <a href="https://files.littlebird.com.au/pb-IQYUdv6nQo.png" rel="nofollow">https://files.littlebird.com.au/pb-IQYUdv6nQo.png</a>
I just tried giving it a coding snippet that has a bug. ChatGPT & Claude found the bug instantly. Mercury fails to find it even after several reprompts (it's hallucinating). On the upside it is significantly faster. That's promising since the edge for ChatGPT and Claude are in the prolonged time and energy they've spent building training infrastructure, tooling, datasets, etc to pump out models with high task performance.
Ok. My go to puzzle is this:<p>You have 2 minutes to cool down a cup of coffee to the lowest temp you can<p>You have two options:<p>1. Add cold milk immediately, then let it sit for 2 mins.<p>2. Let it sit for 2 mins, then add the cold milk.<p>Which one cools the coffee to the lowest temperature and why?<p>And Mercury gets this right - while as of right now ChatGPT 4o get it wrong.<p>So that’s pretty impressive.
It's kind of weird to think that in a coding assistant, an LLM is regularly asked to produce a valid block of code top to bottom, or repeat a section of code with changes, when that's not what we do. (There are other intuitively odd things about this, like the amount of compute spent generating 'easy' tokens, e.g. repeating unchanged code.) Some of that might be that models are just weird and intuition doesn't apply. But maybe the way we do it--jumping around, correcting as we go, etc.--is legitimately an efficient use of effort, and a model could do its job better, with less effort, or both if it too used some approach other than generating the whole sequence start-to-finish.<p>There's already stuff in the wild moving that direction without completely rethinking how models work. Cursor and now other tools seem to have models for 'next edit' not just 'next word typed'. Agents can edit a thing and then edit again (in response to lints or whatever else); approaches based on tools and prompting like that can be iterated on without the level of resources needed to train a model. You could also imagine post-training a model specifically to be good at producing edit sequences, so it can actually 'hit backspace' or replace part of what it's written if it becomes clear it wasn't right, or if two parts of the output 'disagree' and need to be reconciled.<p>From a quick search it looks like <a href="https://arxiv.org/abs/2306.05426" rel="nofollow">https://arxiv.org/abs/2306.05426</a> in 2023 discussed backtracking LLMs and <a href="https://arxiv.org/html/2410.02749v3" rel="nofollow">https://arxiv.org/html/2410.02749v3</a> / <a href="https://github.com/upiterbarg/lintseq">https://github.com/upiterbarg/lintseq</a> trained models on synthetic edit sequences. There is probably more out there with some digging. (Not the same topic, but the search also turned up <a href="https://arxiv.org/html/2504.20196" rel="nofollow">https://arxiv.org/html/2504.20196</a> from this Monday(!) about automatic prompt improvement for an internal code-editing tool at Google.)
Looks interesting, and my intuition is that code is a good application of diffusion LLMs, especially if they get support for "constrained generation", as there's already plenty of tooling around code (linters and so on).<p>Something I don't see explored in their presentation is the ability of the model to restore from errors / correct itself. SotA LLMs shine at this, a few back and forth w/ sonnet / gemini pro / etc really solves most problems nowadays.
Anybody able to get the "View Technical Report" button at the bottom to do anything? I was curious to glean more details but it doesn't work on either of my devices.<p>I'm curious what level of detail they're comfortable publishing around this, or are they going full secret mode?
There are so many models. Every single day half a dozen new models land. And even more papers.<p>It feels like models are becoming fungible apart from the hyperscaler frontier models from OpenAI, Google, Anthropic, et al.<p>I suppose VCs won't be funding many more "labs"-type companies or "we have a model" as the core value prop companies? Unless it has a tight application loop or is truly unique?<p>Disregarding the team composition, research background, and specific problem domain - if you were starting an AI company today, what part of the stack would you focus on? Foundation models, AI/ML infra, tooling, application layer, ...?<p>Where does the value accrue? What are the most important problems to work on?
I would be interested to see how people would apply this working as a coding assistant. For me, its application in solutioning seem very strong, particularly vibe coding, and potentially agentic coding. One of my main gripes with LLM-assisted coding is that for me to get the output which catches all scenarios I envision takes multiple attempts in refining my prompt requiring regeneration of the output. Iterations are slow and often painful.<p>With the speed this can generate its solutions, you could have it loop through attempting the solution, feeding itself the output (including any errors found), and going again until it builds the "correct" solution.
This sounds like a neat idea but it seems like bad timing. OpenAI just released token-based that beats the best diffusion image generation. If diffusion isn't even the best at generating images, I don't know if I'm going to spend a lot of time evaluating it for text.<p>Speed is great but it doesn't seem like other text-based model trends are going to work out of the box, like reasoning. So you have to get dLLMs up to the quality of a regular autoregressive LLM and then you need to innovate more to catch up to reasoning models, just to match the current state of the art. It's possible they'll get there, but I'm not optimistic.
I just tried it and it was able to perfectly generate a piece of code for me that i needed for generating a 12 month rolling graph based on a list of invoices and it seemed a bit easier and faster then chatgpt.
>Mercury is up to 10x faster than frontier speed-optimized LLMs. Our models run at over 1000 tokens/sec on NVIDIA H100s, a speed previously possible only using custom chips.<p>This means on custom chips (Cerebras, Graphcore, etc...) we might see 10k-100k tokens/sec? Amazing stuff!<p>Also of note, funny how text generation started w/ autoregression/tokens and diffusion seems to perform better, while image generation went the opposite way.
I'd hope that with diffusion, it would be able to go back and forth between parts of the output to adjust issues with part of the output which it had previously generated. This would not be possible with a purely sequential model.<p>However,<p>> Prompt: Write a sentence with ten words which has exactly as many r’s in the first five words as in the last five<p>><p>> Response: Rapidly running, rats rush, racing, racing.
This is awesome for the future of autocomplete. Current models aren't fast enough to give useful suggestions at the speed that I type - but this certainly is.<p>That said, token-based models are currently fast enough for most real-time chat applications, so I wonder what other use-cases there will be where speed is greatly prioritized over smarts. Perhaps trading on Trump tweets?
Would have been nice if along to this demo video[1] comparing speed of 3 models, they would have share the artifacts as well, so we can compare quality.<p>[1] <a href="https://framerusercontent.com/assets/cWawWRJn8gJqqCGDsGb2gN0pyU.mp4" rel="nofollow">https://framerusercontent.com/assets/cWawWRJn8gJqqCGDsGb2gN0...</a>
This is genius! There are tradeoffs between diffusion and neural network models in image generation so why not use diffusion models in text generation? Excited to see where this ends up and I wouldn't be surprised if we saw some of these types of models appear in the future updates to popular families like Llama or Qwen.
Related paper discussing diffusion models from 2 months ago: <a href="https://arxiv.org/abs/2502.09992" rel="nofollow">https://arxiv.org/abs/2502.09992</a>
It seems that with this technique you could not possibly do "chain of thought." That technique seems unique to auto-regressive architecture. Right?
1000+ tokens/sec on H100s, a 5–10x speedup over typical autoregressive models — and without needing exotic hardware like Groq or Cerebras - impressive
Interesting approach. However, I never thought of auto regression being _the_ current issue with language modeling. If anything it seems the community was generally surprised just how far next "token" prediction took us. Remember back when we did char generating RNNs and were impressed they could make almost coherent sentences?<p>Diffusion is an alternative but I am having a hard time understanding the whole "built in error correction" that sounds like marketing BS. Both approaches replicate probability distributions which will be naturally error-prone because of variance.
The linked page only compares to very old and very small models. But the pricing is higher even than the latest Gemini Flash 2.5 model, which performs <i>far</i> better than anything they compare to.
this convo has me rethinking how much speed actually matters vs just getting stuff right - you think most problems are just about better habits or purely tooling upgrades at this point?
Super happy to see something like this getting traction. As someone that is trying to reduce my carbon footprint sometimes I feel bad about asking any model to do something trivial. With something like that perhaps the guilt will lessen