Speculative decoding is an advanced technique in natural language processing that significantly enhances the inference speed of large language models (LLMs). This document provides a comprehensive analysis of speculative decoding, including its underlying mechanisms, mathematical foundations, and performance optimization strategies.
Speculative decoding operates on a "Draft-then-Verify" paradigm, consisting of three key steps:
- Drafting: A smaller, faster model generates N speculative tokens ahead in the sequence.
- Verification: The main, more accurate LLM verifies these N speculative tokens in parallel.
- Acceptance and Rejection: Accepted tokens are appended to the final output, while rejected tokens lead to restarting the drafting process from the rejection point.
This approach transforms sequential token generation into a more parallelized operation, leveraging parallel verification to achieve significant speed improvements.
Key parameters:
- N: Number of tokens speculated in each cycle
- P: Probability of a speculated token being accepted
Speculative decoding uses rejection sampling to maintain output quality while achieving speedup. For each speculated token, the acceptance probability is:
Where q_i(j) is the main model's probability and p_i(j) is the draft model's probability for token j at position i. This ensures the final output maintains the main model's distribution quality.
We model the expected number of additional tokens that can be successfully speculated after k tokens have already been accepted, based on the overall acceptance probability P.
Let E_k be the expected number of additional tokens that can be successfully speculated after k tokens have already been accepted.
Base Case:
(no more tokens can be speculated beyond N)
Recursive Relation: For 0 ≤ k < N:
The closed-form solution for E_0 (expected additional tokens from the start):
This formula provides the expected number of additional tokens that can be successfully speculated when attempting to speculate N tokens ahead, given an overall acceptance probability P.
- Speculative Time: T_s · N (time to generate N speculative tokens)
- Verification Time: T_v (time to verify the batch of N tokens)
For each speculation cycle:
The expected number of tokens per cycle is E_0 + 1, which simplifies to:
This formula represents the number of tokens generated per unit time, providing a direct measure of the system's throughput.
Speculative decoding significantly speeds up token generation in LLMs through parallel processing. Key optimization insights include:
- The draft model should find a balance between model size (Ts) and acceptance probability (P) to get high speed ups
- Optimal number of speculated tokens (N) stays small unless your draft model have both very high acceptance rate and very fast generation