100

Why speculative decoding speeds up LLMs

Transcript

Instead of generating one token at a time, speculative decoding lets a small draft model propose several tokens, then the large model checks them in one parallel pass, often cutting latency by about two to three times. The magic lies in memory: modern GPUs often wait on weight reads more than arithmetic, so batching several candidate tokens makes better use of spare compute and the KV cache. It can fail when the draft and target model disagree too often, which lowers the acceptance rate and turns the extra drafting work into overhead. That is why speculative decoding shines most in low-batch, memory-bound workloads with predictable text, but loses its edge under heavy concurrency, high entropy sampling, or a poor draft match.