#benchmarks — Hans Christian Thjømøe

An hourglass counting down on a desk, sand falling through the narrow neck

May 18, 2026

Your vLLM Thinking Budget Was Doing Nothing With MTP On

vLLM 0.21.0 shipped Friday with a quiet fix: thinking_token_budget was being silently ignored when MTP speculative decoding was enabled. If you serve reasoning models with spec decode, you have been paying for it.

Time-lapse photograph of US dollar banknotes on fire

May 17, 2026

Claude Code v2.1.100+ Burns ~20K Phantom Tokens Per Request

A server-side bug in Claude Code v2.1.100+ inflates every request by roughly 20K cache_creation tokens — about 40% overhead. Pin v2.1.98 until fixed.

Laptop screen displaying performance analytics graphs and dashboards

May 8, 2026

A 27B Model on a Single GPU Is 10 Points Off Claude Opus 4.7

Qwen3.6-27B running locally now scores within 10 points of frontier closed models on SWE-bench Verified. The benchmark table, lined up side by side.

← All posts