1. TL;DR
Mercury Coder Mini (2 B) and Small (7 B) junk the left-to-right decoding loop and instead denoise many tokens in parallel. On one NVIDIA H100 they spit out 1 109 tok/s and 737 tok/s respectively around ten times quicker than GPT-4o Flash-Lite, Claude 3.5 Haiku or Codestral while matching their pass@1 on HumanEval and MBPP.
Don’t forget to checkout Ask That Llama section below!
2. Why this paper matters
Builders bleed most on latency, not model size. By turning generation into a coarse-to-fine denoising game, Mercury shows that diffusion long king in images finally wins at commercial text scale. It’s the first public demo to break the 1 000 tok/s barrier for code without quality loss.
3. How diffusion text generation works (quick recap)
- Forward process: gradually replace clean tokens with a special “noise/mask” symbol until the sequence is fully blanked.
- Reverse process: at each timestep the Transformer sees the noisy sequence and a timestep embedding, then predicts the original tokens for
masked positions at once. - Repeat for roughly 20-30 steps; the sequence sharpens from gibberish to polished code.
Because we batch the whole context, the GPU stays 100% busy and the autoregressive choke-point disappears. The backbone is a plain Transformer, so RoPE, Flash-Attention, LoRA everything you already know still plugs in.
3 b. Under the hood – training & inference cookbook
| Piece | Mercury’s recipe |
|---|---|
| Objective | Discrete-diffusion loss (replaces next-token CE) with noise-level weighting |
| Architecture | Vanilla Transformer + time embeddings + adaptive LayerNorm |
| Context window | Native 32 k; interpolation stretches to 128 k |
| Sampling schedule | 12, 20 or 30 denoise passes picked on the fly to juggle load vs quality |
| Serving stack | Fused CUDA kernels, dynamic batching, KV-cache paging; delivers the quoted 1 109 tok/s wall-clock |
| API surface | /chat/completions clone just change the base URL |
4. Key contributions
- First 7 B-scale diffusion LLM with public benchmarks
- Hard wall-clock win: > 1 K tok/s, not just “fewer steps on paper”
- External validation: #1 speed, #2 quality on Copilot Arena’s live leaderboard
5. Results snapshot (speed ↔ quality)
| Model (~7 B) | HumanEval | MBPP | Speed (tok/s) |
|---|---|---|---|
| Mercury Mini | 88.0% | 77.1% | 1 109 |
| GPT-4o Mini | 88.0% | 74.6% | 59 |
| Claude 3.5 Haiku | 86.0% | 78.0% | 61 |
| Codestral 2501 | 85.0% | 72.2% | 171 |
Benchmarked with 1 000 → 1 000 I/O tokens on a single H100.
6. Pros & cons
| 👍 What shines | 🤔 What to watch |
|---|---|
| Ten-fold throughput cuts serving cost; perfect for IDE autocomplete & agent loops | ~20 + denoise steps hurt CPU/edge deployments GPUs only for now |
| Transformer-compatible → painless LoRA, RLHF, retrieval tricks | Training recipe opaque; weights closed (for now) |
| Third-party latency & quality audits | Broader reasoning still trails GPT-4-class giants |
7. Why the 🧪 experimental tag?
- Brand-new algorithm class best practices still forming
- API in flux sampling presets, prices and endpoints may shift
- Closed weights you’re tied to their cloud for now
- Narrow eval coding focus; safety & general-chat alignment WIP
Use it for prototypes, but keep a fallback AR model in production.
You can skip this section and scroll down if you’re not into nerd math.
8. Under-the-hood deep dive (math and intuition; I recommend reading the paper along-side this section)
8.1 The objects we play with
| Symbol | What it is | Plain-English meaning |
|---|---|---|
| | the clean ground-truth sequence of length |
| special token | plays the role of “noise” for text |
| noisy sequence at step | mixture of clean tokens and masks |
| total steps (12, 20 or 30) | how many refinement passes we will do |
| scalar in (0,1) | probability of losing a token at step |
| | probability a token survives up to step |
8.2 Forward process : how we add noise
For each position we either keep the original token or replace it with a mask.
What this says
- After
ticks of the corruption clock, each token is independently blanked out with probability. - The bigger the step index, the more blank tokens you expect to see.
8.3 Reverse model : how we remove noise
A Transformer receives the current noisy sequence plus a learned embedding of the step index .
It outputs a full-vocabulary logit vector for every position, turning into a categorical distribution
What this means
- Given the partially masked sentence and “how fuzzy” it currently is (the timestep), the network predicts what the original clean token was at every index.
- All positions are predicted in one shot, not left-to-right.
8.4 Training loss: make the predictions match the truth
Line-by-line explanation
- Draw a clean sentence
from the data set. - Pick a timestep
uniformly (or with a schedule). - Corrupt
intousing the forward rule. - Ask the model to reconstruct
from. - Penalise the negative log-probability of every correct token, but scale it by
.
Why the weight ?
- Very small
: the task is almost trivial (hardly any masks) so we down-weight it. - Very large
: the task is hopeless (all masks) so we also down-weight it. - Middle
: the model learns the most, so we give these steps the highest weight.
8.5 Sampling: turning pure noise into fluent text
- Initial state
. - Loop for
:- Run
to get logits for every slot. - Pick the arg-max (or sample) at each masked position.
- Drop those predictions into the sequence to form
.
- Run
- Return the fully denoised sequence
.
At every pass we update all positions, so the cost is proportional to (≈20) instead of the length (hundreds or thousands).
8.6 Why fewer passes can still beat left-to-right
- Autoregressive decoding does one token per forward pass →
passes. - Diffusion does
passes on the whole sequence → roughly 20 passes. - Each pass is heavier, but GPUs love large matrix multiplies more than many small ones, so utilisation jumps from ~40 % to ~90 %.
Net outcome on an H100: more than ten-fold increase in tokens per second.
8.7 Self-conditioning trick
After each step we cache the logits and feed them (concatenated) back into the next step.
Think of it as giving the model a sketch of its previous guess so it can refine instead of restarting.
8.8 Autoregression as a special case
If you let (almost no masking) and set (one step per position), the process reduces to standard left-to-right language modelling:
mask a single future slot, predict it, move on.
Classical autoregression is just the infinite-step, zero-noise corner of this broader diffusion family.
9. Competitor spotlight – Mercury vs Gemini Diffusion
| Feature | Mercury Mini | Gemini Diffusion |
|---|---|---|
| Speed (H100 / TPU-v5e) | 1 109 tok/s | 1 479 tok/s (lab) |
| Focus | Code & agents | General + code |
| API | Public, OpenAI-compatible | Wait-list |
| Sizes | 2 B / 7 B | ≈7–10 B (est.) |
| Open weights | ✘ | ✘ |
Gemini is faster on paper, but Mercury is the diffusion LM you can call and fine-tune today.
10. Open-source diffusion LMs you can self-host
| Model | Size | What you get | Licence |
|---|---|---|---|
| LLaDA-8B | 8 B | Base & Instruct checkpoints | MIT |
| DiffuLLaMA-7B | 7 B | Continual-PT LLaMA-2 + LoRA | Apache-2.0 |
| BD3-LM | 1.3–6.7 B | Variable block sizes | Apache-2.0 |
| DiffuGPT / DiffuLLaMA-LoRA | 125 M–7 B | Retrofit adapters | Apache-2.0 |
Speeds hover in the 100–300 tok/s band on A100 great for experimentation, slower than Mercury.
11. So why get excited about Mercury if OSS options already exist?
- Order-of-magnitude speed jump 1 K + tok/s dwarfs today’s OSS diffusion LMs
- Serious system engineering kernels, KV-paging, auto-step scheduling turn theory into wall-clock wins
- Third-party validation Copilot Arena & Artificial Analysis rank Mercury #1 in latency
- Vertical focus trained for code, supports fill-in-the-middle, already ships IDE plug-ins
- Bridge from lab to prod SLAs, on-prem, familiar API while diffusion tooling matures
12. Why this belongs on your watch-list
Diffusion LMs just crossed from neat research to real-world latency killers. Mercury proves parallel denoising can outrun every mainstream AR trick, and Gemini’s numbers show Big Tech smells the same opportunity. Whether you’re building an IDE copilot, chain-of-thought agent or multimodal RAG stack, watching Mercury (and the OSS projects chasing it) could hand you a ten-fold latency dividend the moment open weights or bigger checkpoints drop.
Ping me if you benchmark Mercury or any OSS diffusion LMs. I’d love to swap notes and plug the fastest one into my ML pipelines!
Ask That Llama!
Try these prompts on your favorite LLM to explore more about this topic by yourself:
🤔 1.Does diffusion in text even make sense?
🧠 2. Chain-of-thought impact
⚠️ 3. Three big drawbacks
4. Beyond autoregression toward AGI