Entering read mode 🤓
Mercury Coder: Diffusion-Powered Code LLMs at Warp Speed • OpenAI/ML skip to content
Site logo

Search

Mercury Coder: Diffusion-Powered Code LLMs at Warp Speed

11 min read

Deep-dive into Inception Labs’ discrete-diffusion language-model family why it’s *really* different, how it works under the hood, and what it means now that open-source diffusion LMs are popping up everywhere.

1. TL;DR

Mercury Coder Mini (2 B) and Small (7 B) junk the left-to-right decoding loop and instead denoise many tokens in parallel. On one NVIDIA H100 they spit out 1 109 tok/s and 737 tok/s respectively around ten times quicker than GPT-4o Flash-Lite, Claude 3.5 Haiku or Codestral while matching their pass@1 on HumanEval and MBPP.

Don’t forget to checkout Ask That Llama section below!

2. Why this paper matters

Builders bleed most on latency, not model size. By turning generation into a coarse-to-fine denoising game, Mercury shows that diffusion long king in images finally wins at commercial text scale. It’s the first public demo to break the 1 000 tok/s barrier for code without quality loss.

3. How diffusion text generation works (quick recap)

  1. Forward process: gradually replace clean tokens with a special “noise/mask” symbol until the sequence is fully blanked.
  2. Reverse process: at each timestep the Transformer sees the noisy sequence and a timestep embedding, then predicts the original tokens for allall masked positions at once.
  3. Repeat for roughly 20-30 steps; the sequence sharpens from gibberish to polished code.

Because we batch the whole context, the GPU stays 100% busy and the autoregressive choke-point disappears. The backbone is a plain Transformer, so RoPE, Flash-Attention, LoRA everything you already know still plugs in.

3 b. Under the hood – training & inference cookbook

PieceMercury’s recipe
ObjectiveDiscrete-diffusion loss (replaces next-token CE) with noise-level weighting
ArchitectureVanilla Transformer + time embeddings + adaptive LayerNorm
Context windowNative 32 k; interpolation stretches to 128 k
Sampling schedule12, 20 or 30 denoise passes picked on the fly to juggle load vs quality
Serving stackFused CUDA kernels, dynamic batching, KV-cache paging; delivers the quoted 1 109 tok/s wall-clock
API surface/chat/completions clone just change the base URL

4. Key contributions

  • First 7 B-scale diffusion LLM with public benchmarks
  • Hard wall-clock win: > 1 K tok/s, not just “fewer steps on paper”
  • External validation: #1 speed, #2 quality on Copilot Arena’s live leaderboard

5. Results snapshot (speed ↔ quality)

Model (~7 B)HumanEvalMBPPSpeed (tok/s)
Mercury Mini88.0%77.1%1 109
GPT-4o Mini88.0%74.6%59
Claude 3.5 Haiku86.0%78.0%61
Codestral 250185.0%72.2%171

Benchmarked with 1 000 → 1 000 I/O tokens on a single H100.

6. Pros & cons

👍 What shines🤔 What to watch
Ten-fold throughput cuts serving cost; perfect for IDE autocomplete & agent loops~20 + denoise steps hurt CPU/edge deployments GPUs only for now
Transformer-compatible → painless LoRA, RLHF, retrieval tricksTraining recipe opaque; weights closed (for now)
Third-party latency & quality auditsBroader reasoning still trails GPT-4-class giants

7. Why the 🧪 experimental tag?

  • Brand-new algorithm class best practices still forming
  • API in flux sampling presets, prices and endpoints may shift
  • Closed weights you’re tied to their cloud for now
  • Narrow eval coding focus; safety & general-chat alignment WIP

Use it for prototypes, but keep a fallback AR model in production.

You can skip this section and scroll down if you’re not into nerd math.

8. Under-the-hood deep dive (math and intuition; I recommend reading the paper along-side this section)

8.1 The objects we play with

SymbolWhat it isPlain-English meaning
x0x_0(x0(1),,x0(L))(x_0^{(1)},\dots,x_0^{(L)})the clean ground-truth sequence of length LL
MASK\langle\text{MASK}\ranglespecial tokenplays the role of “noise” for text
ztz_tnoisy sequence at step ttmixture of clean tokens and masks
TTtotal steps (12, 20 or 30)how many refinement passes we will do
βt\beta_tscalar in (0,1)probability of losing a token at step tt
αt\alpha_ts=1t(1βs)\displaystyle\prod_{s=1}^t (1-\beta_s)probability a token survives up to step tt

8.2 Forward process qq: how we add noise

For each position ii we either keep the original token or replace it with a mask.

q ⁣(zt(i)x0(i))probability model={x0(i)with prob. αt,MASKwith prob. 1αt\underbrace{q\!\bigl(z_t^{(i)} \mid x_0^{(i)}\bigr)}_{\text{probability model}}
=
\begin{cases}
x_0^{(i)} & \text{with prob. } \alpha_t, \\[6pt]
\langle\text{MASK}\rangle & \text{with prob. } 1-\alpha_t
\end{cases}

What this says

  • After tt ticks of the corruption clock, each token is independently blanked out with probability 1αt1-\alpha_t.
  • The bigger the step index, the more blank tokens you expect to see.

8.3 Reverse model pθp_\theta: how we remove noise

A Transformer fθf_\theta receives the current noisy sequence ztz_t plus a learned embedding of the step index tt.
It outputs a full-vocabulary logit vector for every position, turning into a categorical distribution

pθ(x0zt,t)=Cat ⁣(x0;softmax ⁣(fθ(zt,et))).p_\theta(x_0 \mid z_t, t)
=
\operatorname{Cat}\!\Bigl(
x_0;\,
\operatorname{softmax}\!\bigl(f_\theta(z_t,\,e_t)\bigr)
\Bigr).

What this means

  • Given the partially masked sentence and “how fuzzy” it currently is (the timestep), the network predicts what the original clean token was at every index.
  • All positions are predicted in one shot, not left-to-right.

8.4 Training loss: make the predictions match the truth

L(θ)=Ex0,t[γ(t)logpθ ⁣(x0zt,t)],γ(t)    βt(1αt).\mathcal{L}(\theta)
=
\mathbb{E}_{x_0,t}
\Bigl[
-\gamma(t)\,
\log p_\theta\!\bigl(x_0 \mid z_t, t\bigr)
\Bigr],
\qquad
\gamma(t)\;\propto\;\beta_t\bigl(1-\alpha_t\bigr).

Line-by-line explanation

  1. Draw a clean sentence x0x_0 from the data set.
  2. Pick a timestep tt uniformly (or with a schedule).
  3. Corrupt x0x_0 into ztz_t using the forward rule.
  4. Ask the model to reconstruct x0x_0 from ztz_t.
  5. Penalise the negative log-probability of every correct token, but scale it by γ(t)\gamma(t).

Why the weight γ(t)\gamma(t)?

  • Very small tt: the task is almost trivial (hardly any masks) so we down-weight it.
  • Very large tt: the task is hopeless (all masks) so we also down-weight it.
  • Middle tt: the model learns the most, so we give these steps the highest weight.

8.5 Sampling: turning pure noise into fluent text

  1. Initial state
    zT=(MASK,,MASK)z_T = (\langle\text{MASK}\rangle,\dots,\langle\text{MASK}\rangle).
  2. Loop for t=T,,1t = T,\,\dots,\,1:
    1. Run fθ(zt,et)f_\theta(z_t,e_t) to get logits for every slot.
    2. Pick the arg-max (or sample) at each masked position.
    3. Drop those predictions into the sequence to form zt1z_{t-1}.
  3. Return the fully denoised sequence z0z_0.

At every pass we update all positions, so the cost is proportional to TT (≈20) instead of the length LL (hundreds or thousands).

8.6 Why fewer passes can still beat left-to-right

  • Autoregressive decoding does one token per forward pass → LL passes.
  • Diffusion does TT passes on the whole sequence → roughly 20 passes.
  • Each pass is heavier, but GPUs love large matrix multiplies more than many small ones, so utilisation jumps from ~40 % to ~90 %.

Net outcome on an H100: more than ten-fold increase in tokens per second.

8.7 Self-conditioning trick

After each step we cache the logits and feed them (concatenated) back into the next step.
Think of it as giving the model a sketch of its previous guess so it can refine instead of restarting.

8.8 Autoregression as a special case

If you let βt0\beta_t \to 0 (almost no masking) and set T=LT = L (one step per position), the process reduces to standard left-to-right language modelling:
mask a single future slot, predict it, move on.
Classical autoregression is just the infinite-step, zero-noise corner of this broader diffusion family.

9. Competitor spotlight – Mercury vs Gemini Diffusion

FeatureMercury MiniGemini Diffusion
Speed (H100 / TPU-v5e)1 109 tok/s1 479 tok/s (lab)
FocusCode & agentsGeneral + code
APIPublic, OpenAI-compatibleWait-list
Sizes2 B / 7 B≈7–10 B (est.)
Open weights

Gemini is faster on paper, but Mercury is the diffusion LM you can call and fine-tune today.

10. Open-source diffusion LMs you can self-host

ModelSizeWhat you getLicence
LLaDA-8B8 BBase & Instruct checkpointsMIT
DiffuLLaMA-7B7 BContinual-PT LLaMA-2 + LoRAApache-2.0
BD3-LM1.3–6.7 BVariable block sizesApache-2.0
DiffuGPT / DiffuLLaMA-LoRA125 M–7 BRetrofit adaptersApache-2.0

Speeds hover in the 100–300 tok/s band on A100 great for experimentation, slower than Mercury.

11. So why get excited about Mercury if OSS options already exist?

  1. Order-of-magnitude speed jump 1 K + tok/s dwarfs today’s OSS diffusion LMs
  2. Serious system engineering kernels, KV-paging, auto-step scheduling turn theory into wall-clock wins
  3. Third-party validation Copilot Arena & Artificial Analysis rank Mercury #1 in latency
  4. Vertical focus trained for code, supports fill-in-the-middle, already ships IDE plug-ins
  5. Bridge from lab to prod SLAs, on-prem, familiar API while diffusion tooling matures

12. Why this belongs on your watch-list

Diffusion LMs just crossed from neat research to real-world latency killers. Mercury proves parallel denoising can outrun every mainstream AR trick, and Gemini’s numbers show Big Tech smells the same opportunity. Whether you’re building an IDE copilot, chain-of-thought agent or multimodal RAG stack, watching Mercury (and the OSS projects chasing it) could hand you a ten-fold latency dividend the moment open weights or bigger checkpoints drop.

Ping me if you benchmark Mercury or any OSS diffusion LMs. I’d love to swap notes and plug the fastest one into my ML pipelines!


Llama Icon

Ask That Llama!

Try these prompts on your favorite LLM to explore more about this topic by yourself:

🤔 1.Does diffusion in text even make sense?

Language is intrinsically sequential, so aren’t autoregressors the “natural” fit? What hidden advantages (or blind spots) does a parallel denoiser reveal?

🧠 2. Chain-of-thought impact

When a model rewrites every token in bulk, does its reasoning trace become sharper, blurrier, or just different? Explain which internal signals you’d probe to judge “quality of thought.”

⚠️ 3. Three big drawbacks

Identify the top-three pain points that still keep diffusion LMs off the main production path, and suggest one experiment to tackle each.

4. Beyond autoregression toward AGI

If AGI demands more than next-token guessing and diffusion feels closer to an energy-based view, how might text denoisers sidestep the classic pitfalls of energy models (mode dropping, slow sampling) while scaling toward general intelligence?