Speculative Decoding Explained: Why Your Local Model Got 2× Faster in 2026

The fastest free performance upgrade in local AI in 2026 was not a new model. It was speculative decoding shipping as a default feature in the runners people actually use — llama.cpp, MLX, vLLM, TGI. The same Qwen3.6-27B running on the same GPU got ~2× faster between January and April. No retraining, no quantization changes, no new hardware.

If you have not turned it on, you are leaving half your tokens-per-second on the table. Here is what is actually happening underneath.

The Problem It Solves

Local inference on consumer hardware is memory-bandwidth-bound. The GPU is not waiting for math — it is waiting for weights to arrive from VRAM. For a 27B model in 4-bit quantization, the GPU has to pull roughly 14 GB through memory bandwidth per generated token. On a 4090 with about 1 TB/s of memory bandwidth, that caps you at roughly 70 tokens/sec regardless of how fast the GPU’s compute units are.

The compute units, meanwhile, sit mostly idle. They could process many more tokens per second if the data were sitting in their caches. The bottleneck is the trip from VRAM to the compute units, not the work the compute units do once data arrives.

Speculative decoding exploits this gap. The idea: if the GPU is going to read all the weights anyway, give it more work to do per memory access.

How It Works

The pattern has been around since 2023 (Leviathan et al. at Google), but it had three flavors that mattered in different stages:

Draft model + target model. The original version. A small “draft” model (say, 1B parameters) generates a guess of the next N tokens. The large “target” model verifies all N guesses in a single forward pass. If the draft was right about the first 4 of 7 guesses, you got 4 tokens for the price of 1 forward pass. If it was wrong on the first guess, you got 1 token for the price of 1 — same as without speculation.

Medusa heads. Instead of running a separate small model, attach extra prediction “heads” to the target model itself. Each head predicts a token at offset +1, +2, +3 from the current position. Same verify-in-parallel mechanic. Cheaper to deploy because there is no second model to load.

EAGLE-2. A more refined version of the Medusa idea — the extra heads use the target model’s hidden states more cleverly to produce better drafts. By 2026, the standard.

MTP (Multi-Token Prediction). The version Qwen3.6 was specifically trained for. The model itself predicts multiple future tokens in one pass, with no separate draft model and no glued-on heads. This is the cleanest version because the training and inference both expect it.

Why It Actually Works

The non-obvious win is that verifying N tokens is almost as cheap as generating one. When the GPU does a forward pass, the cost is dominated by reading the weights through memory bandwidth. The math of processing one position vs N adjacent positions is essentially free in comparison.

So the optimization is: generate cheaply (draft model, small heads, or trained-in MTP), verify expensively but in parallel. As long as the draft is right more than ~30% of the time, you come out ahead.

On Qwen3.6 with MTP, draft acceptance hits 60-70% in normal coding workloads. You get an average of 2-3 tokens per forward pass instead of 1. That is your 2× — sometimes 3×.

What Speedup You Actually See

Rough numbers from May 2026, single 4090, Qwen3.6-27B, 4-bit quantization:

Setting	Tokens/sec	Notes
No speculation	68	Baseline
MTP, default settings	138	The free upgrade
MTP, tuned (—n-predict 6)	152	Marginal returns past 4
EAGLE-2 on a non-MTP model	110	Worth it but smaller win

The win is largest on the workloads with the most predictability — code completion, long structured outputs, repetitive formatting. It is smallest on free-form creative writing or anything where every token is genuinely unpredictable.

Where It Does Not Help

Three cases:

Small models (under 4B). Memory bandwidth is not your bottleneck on a 2B model — the weights are small enough that the GPU is compute-bound or at least balanced. Speculative decoding overhead can actually slow you down.

Large batch sizes. If you are serving 32 simultaneous requests on a server, each forward pass is already amortizing the weight read across many sequences. Adding speculation per-sequence does not help and the bookkeeping costs you something.

Anything below the verification compute budget. If you are running on a Raspberry Pi or a CPU-only setup where forward passes are slow because of compute, not memory, speculation makes things worse.

For local single-user inference on a consumer GPU running a 20-30B model — which is the modal use case in 2026 — speculative decoding is essentially always a win.

How to Turn It On

llama.cpp. As of the early 2026 rename, the flag is --mtp for models that support it natively (Qwen3.6 family), or --draft-model plus a smaller draft model for everything else. The default -ngl (n_gpu_layers) and -c (context size) settings still apply.

MLX (Apple Silicon). Speculative decoding ships in mlx-lm with the --draft-model flag. Native MTP support for Qwen3.6 landed in March 2026.

vLLM. --speculative-model and --num-speculative-tokens flags. vLLM is more focused on batched serving so the gains are smaller, but they exist for low-batch workloads.

Ollama. Hidden behind environment variables in May 2026 — OLLAMA_SPECULATIVE_DECODING=1. Will likely become default by late 2026.

Most tooling that uses these runners as a backend (Continue, Open WebUI, the OpenAI-compatible API consumers) inherits the speedup automatically once the runner has it on.

Why This Was Not a Default Until 2026

Three reasons it took until 2026 to ship as default:

Implementation complexity. The bookkeeping around accepting partial drafts, rolling back KV cache state, and handling the various failure modes is genuinely fiddly. The reference implementations took years to stabilize.
Training requirement (for the best version). MTP needs the model to be trained with the multi-token objective from the start. The first wave of MTP-native models is the Qwen3.6 family, late 2025 / early 2026.
The benchmarks were not pushing it. Throughput benchmarks measured tokens-per-second at batch size 32 on H100s, where speculation barely helps. The single-user-local-inference case where speculation wins big was not the case anyone with a tooling-development budget was optimizing for.

Three things changed at roughly the same time: the runners shipped solid speculation implementations, MTP-trained models became available, and the community of people doing serious work on local inference got large enough to surface the gap.

What This Means Going Forward

Speculative decoding is now the floor, not the ceiling. The interesting performance research in 2026 is past speculation — kv-cache compression, attention-pattern optimization, hardware-aware quantization. Those are the next 2× behind speculative decoding, and they are more architecturally invasive.

But for “what should I do today to make my local model faster” — speculation is the answer, and turning it on takes one flag. If you have not, do.

Your Local Qwen3.6 Throughput Probably Just Halved — the llama.cpp flag rename watch-out
Running Qwen3.6-27B Locally — hardware, quantization, runners
The Local AI Inflection Point: May 2026 — the wider story