Mercury: Commercial-scale diffusion language model

385 点作者 HyprMusic17 天前

32 条评论

inerte17 天前

Not sure if I would tradeoff speed for accuracy.Yes, it's incredible boring to wait for the AI Agents in IDEs to finish their job. I get distracted and open YouTube. Once I gave a prompt so big and complex to Cline it spent 2 straight hours writing code.But after these 2 hours I spent 16 more tweaking and fixing all the stuff that wasn't working. I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.I've been more and more only using the "thinking" models of o3 in ChatGPT, and Gemini / Claude in IDEs. They're slower, but usually get it right.But at the same time I am open to the idea that speed can unlock new ways of using the tooling. It would still be awesome to basically just have a conversation with my IDE while I am manually testing the app. Or combine really fast models like this one with a "thinking background" one, that would runs for seconds/minutes but try to catch the bugs left behind.I guess only giving a try will tell.

评论 #43851798 未加载

评论 #43851632 未加载

评论 #43851681 未加载

评论 #43851624 未加载

评论 #43855098 未加载

评论 #43852327 未加载

评论 #43854210 未加载

评论 #43856565 未加载

评论 #43860404 未加载

评论 #43852972 未加载

评论 #43851800 未加载

dmos6217 天前

If the benchmarks aren't lying, Mercury Coder Small is as smart as 4o mini and costs the same, but is order of magnitude faster when outputting (unclear if pre-output delay is notably different). Pretty cool. However, I'm under the impression that 4o-mini was superceded by 4.1-mini and 4.1-nano for all use cases (correct me if I'm wrong). Unfortunately they didn't publish comparisons with the 4.1 line, which feels like an attempt to manipulate the optics. Or am I misreading this?Btw, why call it "coder"? 4o-mini level of intelligence is for extracting structured data and basic summaries, definitely not for coding.

评论 #43854945 未加载

g-mork17 天前

There are some open weight attempts at this around too: <a href="https://old.reddit.com/r/LocalLLaMA/search?q=diffusion&restrict_sr=on" rel="nofollow">https://old.reddit.com/r/LocalLLaMA/search?q=diffusion&restr...</a>Saw another on Twitter past few days that looked like a better contender to Mercury, doesn't look like it got posted to LocalLLaMa, and I can't find it now. Very exciting stuff

评论 #43852025 未加载

m-hodges17 天前

It fails the MU Puzzle¹ by violating rules:To transform the string "AB" to "AC" using the given rules, follow these steps:1. *Apply Rule 1*: Add "C" to the end of "AB" (since it ends in "B"). - Result: "ABC"2. *Apply Rule 4*: Remove the substring "CC" from "ABC". - Result: "AC"Thus, the series of transformations is: - "AB" → "ABC" (Rule 1) - "ABC" → "AC" (Rule 4)This sequence successfully transforms "AB" to "AC".¹ <a href="https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-mu-puzzle/" rel="nofollow">https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-...</a>

评论 #43872598 未加载

schappim17 天前

It's nice to see a team doing something different.The cost[1] is US$1.00 per million output tokens and US$0.25 per million input tokens. By comparison, Gemini 2.5 Flash Preview charges US$0.15 per million tokens for text input and $0.60 (non-thinking) output[2].Hmmm... at those prices they need to focus on markets where speed is especially important, eg high-frequency trading, transcription/translation services and hardware/IoT alerting!1. <a href="https://files.littlebird.com.au/Screenshot-2025-05-01-at-9.36.46-am.png" rel="nofollow">https://files.littlebird.com.au/Screenshot-2025-05-01-at-9.3...</a>2. <a href="https://files.littlebird.com.au/pb-IQYUdv6nQo.png" rel="nofollow">https://files.littlebird.com.au/pb-IQYUdv6nQo.png</a>

评论 #43855356 未加载

评论 #43853633 未加载

评论 #43852210 未加载

vlovich12317 天前

I just tried giving it a coding snippet that has a bug. ChatGPT & Claude found the bug instantly. Mercury fails to find it even after several reprompts (it's hallucinating). On the upside it is significantly faster. That's promising since the edge for ChatGPT and Claude are in the prolonged time and energy they've spent building training infrastructure, tooling, datasets, etc to pump out models with high task performance.

评论 #43855263 未加载

jonplackett17 天前

Ok. My go to puzzle is this:You have 2 minutes to cool down a cup of coffee to the lowest temp you canYou have two options:1. Add cold milk immediately, then let it sit for 2 mins.2. Let it sit for 2 mins, then add the cold milk.Which one cools the coffee to the lowest temperature and why?And Mercury gets this right - while as of right now ChatGPT 4o get it wrong.So that’s pretty impressive.

评论 #43852134 未加载

评论 #43852115 未加载

评论 #43851667 未加载

评论 #43851610 未加载

评论 #43851494 未加载

评论 #43851809 未加载

评论 #43851604 未加载

评论 #43851788 未加载

评论 #43851852 未加载

评论 #43853434 未加载

评论 #43851887 未加载

twotwotwo17 天前

It's kind of weird to think that in a coding assistant, an LLM is regularly asked to produce a valid block of code top to bottom, or repeat a section of code with changes, when that's not what we do. (There are other intuitively odd things about this, like the amount of compute spent generating 'easy' tokens, e.g. repeating unchanged code.) Some of that might be that models are just weird and intuition doesn't apply. But maybe the way we do it--jumping around, correcting as we go, etc.--is legitimately an efficient use of effort, and a model could do its job better, with less effort, or both if it too used some approach other than generating the whole sequence start-to-finish.There's already stuff in the wild moving that direction without completely rethinking how models work. Cursor and now other tools seem to have models for 'next edit' not just 'next word typed'. Agents can edit a thing and then edit again (in response to lints or whatever else); approaches based on tools and prompting like that can be iterated on without the level of resources needed to train a model. You could also imagine post-training a model specifically to be good at producing edit sequences, so it can actually 'hit backspace' or replace part of what it's written if it becomes clear it wasn't right, or if two parts of the output 'disagree' and need to be reconciled.From a quick search it looks like <a href="https://arxiv.org/abs/2306.05426" rel="nofollow">https://arxiv.org/abs/2306.05426</a> in 2023 discussed backtracking LLMs and <a href="https://arxiv.org/html/2410.02749v3" rel="nofollow">https://arxiv.org/html/2410.02749v3</a> / <a href="https://github.com/upiterbarg/lintseq">https://github.com/upiterbarg/lintseq</a> trained models on synthetic edit sequences. There is probably more out there with some digging. (Not the same topic, but the search also turned up <a href="https://arxiv.org/html/2504.20196" rel="nofollow">https://arxiv.org/html/2504.20196</a> from this Monday(!) about automatic prompt improvement for an internal code-editing tool at Google.)

评论 #43854900 未加载

NitpickLawyer17 天前

Looks interesting, and my intuition is that code is a good application of diffusion LLMs, especially if they get support for "constrained generation", as there's already plenty of tooling around code (linters and so on).Something I don't see explored in their presentation is the ability of the model to restore from errors / correct itself. SotA LLMs shine at this, a few back and forth w/ sonnet / gemini pro / etc really solves most problems nowadays.

freeqaz17 天前

Anybody able to get the "View Technical Report" button at the bottom to do anything? I was curious to glean more details but it doesn't work on either of my devices.I'm curious what level of detail they're comfortable publishing around this, or are they going full secret mode?

评论 #43860078 未加载

评论 #43876916 未加载

echelon17 天前

There are so many models. Every single day half a dozen new models land. And even more papers.It feels like models are becoming fungible apart from the hyperscaler frontier models from OpenAI, Google, Anthropic, et al.I suppose VCs won't be funding many more "labs"-type companies or "we have a model" as the core value prop companies? Unless it has a tight application loop or is truly unique?Disregarding the team composition, research background, and specific problem domain - if you were starting an AI company today, what part of the stack would you focus on? Foundation models, AI/ML infra, tooling, application layer, ...?Where does the value accrue? What are the most important problems to work on?

评论 #43854055 未加载

jtonz17 天前

I would be interested to see how people would apply this working as a coding assistant. For me, its application in solutioning seem very strong, particularly vibe coding, and potentially agentic coding. One of my main gripes with LLM-assisted coding is that for me to get the output which catches all scenarios I envision takes multiple attempts in refining my prompt requiring regeneration of the output. Iterations are slow and often painful.With the speed this can generate its solutions, you could have it loop through attempting the solution, feeding itself the output (including any errors found), and going again until it builds the "correct" solution.

评论 #43853447 未加载

评论 #43852213 未加载

parsimo201017 天前

This sounds like a neat idea but it seems like bad timing. OpenAI just released token-based that beats the best diffusion image generation. If diffusion isn't even the best at generating images, I don't know if I'm going to spend a lot of time evaluating it for text.Speed is great but it doesn't seem like other text-based model trends are going to work out of the box, like reasoning. So you have to get dLLMs up to the quality of a regular autoregressive LLM and then you need to innovate more to catch up to reasoning models, just to match the current state of the art. It's possible they'll get there, but I'm not optimistic.

评论 #43851511 未加载

评论 #43851505 未加载

jakeinsdca17 天前

I just tried it and it was able to perfectly generate a piece of code for me that i needed for generating a 12 month rolling graph based on a list of invoices and it seemed a bit easier and faster then chatgpt.

moralestapia17 天前

>Mercury is up to 10x faster than frontier speed-optimized LLMs. Our models run at over 1000 tokens/sec on NVIDIA H100s, a speed previously possible only using custom chips.This means on custom chips (Cerebras, Graphcore, etc...) we might see 10k-100k tokens/sec? Amazing stuff!Also of note, funny how text generation started w/ autoregression/tokens and diffusion seems to perform better, while image generation went the opposite way.

评论 #43868083 未加载

agnishom17 天前

I'd hope that with diffusion, it would be able to go back and forth between parts of the output to adjust issues with part of the output which it had previously generated. This would not be possible with a purely sequential model.However,> Prompt: Write a sentence with ten words which has exactly as many r’s in the first five words as in the last five>> Response: Rapidly running, rats rush, racing, racing.

评论 #43854350 未加载

pants217 天前

This is awesome for the future of autocomplete. Current models aren't fast enough to give useful suggestions at the speed that I type - but this certainly is.That said, token-based models are currently fast enough for most real-time chat applications, so I wonder what other use-cases there will be where speed is greatly prioritized over smarts. Perhaps trading on Trump tweets?

tzury17 天前

Would have been nice if along to this demo video[1] comparing speed of 3 models, they would have share the artifacts as well, so we can compare quality.[1] <a href="https://framerusercontent.com/assets/cWawWRJn8gJqqCGDsGb2gN0pyU.mp4" rel="nofollow">https://framerusercontent.com/assets/cWawWRJn8gJqqCGDsGb2gN0...</a>

kittikitti17 天前

This is genius! There are tradeoffs between diffusion and neural network models in image generation so why not use diffusion models in text generation? Excited to see where this ends up and I wouldn't be surprised if we saw some of these types of models appear in the future updates to popular families like Llama or Qwen.

StriverGuy17 天前

Related paper discussing diffusion models from 2 months ago: <a href="https://arxiv.org/abs/2502.09992" rel="nofollow">https://arxiv.org/abs/2502.09992</a>

mlsu17 天前

It seems that with this technique you could not possibly do "chain of thought." That technique seems unique to auto-regressive architecture. Right?

评论 #43852319 未加载

badmonster17 天前

1000+ tokens/sec on H100s, a 5–10x speedup over typical autoregressive models — and without needing exotic hardware like Groq or Cerebras - impressive

评论 #43857269 未加载

carterschonwald17 天前

I actually just tried it. And I’m very impressed. Or at least it’s reasonable code to start with for nontrivial systems.

byearthithatius17 天前

Interesting approach. However, I never thought of auto regression being _the_ current issue with language modeling. If anything it seems the community was generally surprised just how far next "token" prediction took us. Remember back when we did char generating RNNs and were impressed they could make almost coherent sentences?Diffusion is an alternative but I am having a hard time understanding the whole "built in error correction" that sounds like marketing BS. Both approaches replicate probability distributions which will be naturally error-prone because of variance.

评论 #43851545 未加载

strangescript17 天前

Speed is great, but you have to set the bar a little higher than last year's tiny models

ZeroTalent17 天前

Look into groq.com guys. some good models at similar speed to inception labs

评论 #43855291 未加载

评论 #43855482 未加载

jph0017 天前

The linked page only compares to very old and very small models. But the pricing is higher even than the latest Gemini Flash 2.5 model, which performs far better than anything they compare to.

评论 #43852060 未加载

评论 #43855003 未加载

评论 #43852223 未加载

good-luck8652317 天前

Everyone will just switch to LibreOffice and Hetzner.High tech US service industry exports are cooked.

stats11117 天前

Can't use the Mercury name Sir. It's a bank!

gitroom17 天前

this convo has me rethinking how much speed actually matters vs just getting stuff right - you think most problems are just about better habits or purely tooling upgrades at this point?

mackepacke17 天前

Nice

marcyb5st17 天前

Super happy to see something like this getting traction. As someone that is trying to reduce my carbon footprint sometimes I feel bad about asking any model to do something trivial. With something like that perhaps the guilt will lessen

评论 #43851480 未加载

评论 #43852009 未加载

评论 #43851747 未加载