One llama.cpp Flag Turns MTP From Dead Weight to 68% Faster

Multi-Token Prediction (MTP) landed in llama.cpp a few weeks ago and initially looked disappointing on Qwen3.6-27B. Early tests showed it losing to plain DFlash on long outputs — roughly 2× slower on workloads above 1000 tokens. Most people set it aside.

One flag changed that verdict: --spec-draft-p-min.

Ian Paterson ran a systematic sweep on a 3090 Ti (24 GiB VRAM) with Qwen3.6-27B Q4, testing plain autoregressive, DFlash, and MTP across four output lengths. The difference between MTP's first pass and second pass is almost entirely explained by adding --spec-draft-p-min 0.75 alongside n_max=6.

What the flag does

--spec-draft-p-min sets a minimum acceptance probability threshold for draft tokens. Without it, MTP speculatively generates and then rejects tokens at a rate that collapses throughput on longer contexts. With a threshold of 0.75, low-confidence drafts are skipped before they waste cycles. The flag landed in vanilla llama.cpp via PR #22397, merged April 28.

The numbers

Output tokens	Autoregressive	DFlash	MTP (`n_max=6 --spec-draft-p-min 0.75`)
100	28.9 tok/s	46.9 tok/s	44.6 tok/s
500	29.1 tok/s	37.0 tok/s	39.1 tok/s
1000	29.1 tok/s	30.2 tok/s	44.4 tok/s
2000	29.0 tok/s	30.1 tok/s	48.9 tok/s

At 2000 output tokens, MTP finishes in 91 seconds. DFlash takes 112 seconds. Plain autoregressive takes 107 seconds. The gap only appears above roughly 900 output tokens — below that, autoregressive's faster prefill still wins on wall clock.

VRAM sits at 20.4–20.9 GiB depending on load state. A 24 GiB card fits with headroom.

The catch

MTP costs you on short outputs. At 100 tokens, DFlash at 46.9 tok/s is the fastest option, and MTP at 44.6 tok/s is marginally behind. If most of your requests are short (chat completions, quick summaries), MTP will not help and may slightly hurt TTFT.

The crossover point is around 900 tokens of output. Agentic workflows that generate long reasoning chains or code files cross that threshold routinely. Short interactive chat does not.

How to replicate

Assuming you have llama.cpp built with CUDA and a GGUF of Qwen3.6-27B Q4:

./llama-server \
  -m qwen3.6-27b-q4_k_m.gguf \
  --n-gpu-layers 99 \
  --speculative-ngram \
  -ngl 99 \
  --draft-max 6 \
  --spec-draft-p-min 0.75

The --spec-draft-p-min flag requires a build from late April or newer (post PR #22397). Earlier builds will error on that flag. Run ./llama-server --help | grep spec-draft to confirm your build has it.

Paerson also notes that --reasoning-budget 256 is worth stacking on top: it caps runaway reasoning chains and saves roughly 10 seconds per request on Qwen3.6-27B without measurable quality regression.

When to reach for this

If you are running Qwen3.6-27B (or a similar MoE) on a 24 GiB card and your primary use case produces outputs longer than ~900 tokens — code generation, long-form drafting, agentic scratchpads — the flag is worth testing. The throughput improvement at 2000 tokens is 68% over plain autoregressive and 62% over DFlash.

If you are doing interactive chat with typical 100–300 token responses, DFlash remains the better pick.

Sources: