Your 12GB MTP Throughput Just Jumped 23%

Qwen3.6-35B-A3B is the current workhorse for mid-tier GPUs, but squeezing it into a 12GB VRAM card while keeping MTP active usually means accepting heavy CPU offloading or constant OOM crashes. Upstream llama.cpp handles the memory layout, but the throughput numbers are soft.

I ran the mtp-bench.py workload against the same IQ4_XS 4.19 bpw GGUF on an RTX 4070 Super (12GB) + Ryzen 7 9700X. The results show a consistent gap between the official build and the community-maintained ik_llama.cpp fork.

Metric	Upstream llama.cpp	ik_llama.cpp	Delta
Avg Throughput	89.76 tok/s	110.24 tok/s	+23%
MTP Accept Rate	0.9393	0.8749	-6.4%
KV Cache Type	q8_0	q8_0	0%

The fork trades a slightly lower draft acceptance rate for a meaningful jump in raw tokens per second. The draft model guesses less conservatively, and the rejection overhead is lower than the throughput penalty of playing it safe. For agentic loops or long-horizon coding tasks, that extra 20 tok/s compounds quickly.

How `--fit-margin` Prevents OOM

Getting it to run without OOM requires the --fit flag paired with --fit-margin. The margin tells the runtime how many bytes to reserve as a buffer before falling back to CPU offloading. For a 12GB card running the 35B A3B variant, --fit-margin 1664 is the sweet spot. If you hit OOM during long prompts or heavy context loading, bump it to 1792 or 2048.

The exact invocation looks like this:

./ik_llama.cpp -m qwen3.6-35b-a3b-iq4_xs.gguf \
  --ctx-size 131072 --cache-type-k q8_0 \
  --speculative qwen3.6-35b-a3b-draft.gguf \
  --spec-draft-max 3 --spec-draft-p-min 0.75 \
  --fit --fit-margin 1664 -t 8

The --spec-draft-p-min 0.75 filter keeps low-probability drafts from clogging the decode loop. It pairs well with the fork's aggressive scheduler.

Reclaiming VRAM From the Desktop Compositor

If 110 tok/s still pushes you to the edge on a 12GB workstation, you can reclaim nearly a gigabyte of VRAM by forcing your desktop compositor to render on the CPU. On KDE Wayland, create a custom SDDM session with these environment variables:

export LIBGL_ALWAYS_SOFTWARE=1
export GALLIUM_DRIVER=llvmpipe
export KWIN_COMPOSE=Q

Idle VRAM drops from over 1024 MB to roughly 126 MB. You lose smooth window animations and hardware-accelerated effects, but the compositor renders via the CPU instead of stealing VRAM from your inference runtime. For a headless terminal or a distraction-free setup, the tradeoff is worth it.

The Tradeoff

The catch is that ik_llama.cpp is a fork. You are not running the latest upstream merges. The maintainer patches selectively, so you need to watch the commit log for breaking changes or upstream MTP adjustments. I would not bet on the gap holding as the official scheduler optimizes, but right now the 23% delta is real and reproducible.

Qwen3.6-35B-A3B MTP on 12GB is no longer a memory experiment. With the right fork, explicit VRAM margins, and a stripped-down compositor, it clears 110 tok/s consistently. I will stick to upstream until the scheduler patches land, but if you need the throughput today, the fork delivers.

Sources:

How --fit-margin Prevents OOM

Reclaiming VRAM From the Desktop Compositor

The Tradeoff

How `--fit-margin` Prevents OOM