Skip to content

v2rockets/sd_optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

Speculative Decoding: Mathematical Analysis and Performance Optimization

Introduction

Speculative decoding is an advanced technique in natural language processing that significantly enhances the inference speed of large language models (LLMs). This document provides a comprehensive analysis of speculative decoding, including its underlying mechanisms, mathematical foundations, and performance optimization strategies.

Mechanism of Speculative Decoding

Speculative decoding operates on a "Draft-then-Verify" paradigm, consisting of three key steps:

  1. Drafting: A smaller, faster model generates N speculative tokens ahead in the sequence.
  2. Verification: The main, more accurate LLM verifies these N speculative tokens in parallel.
  3. Acceptance and Rejection: Accepted tokens are appended to the final output, while rejected tokens lead to restarting the drafting process from the rejection point.

This approach transforms sequential token generation into a more parallelized operation, leveraging parallel verification to achieve significant speed improvements.

Key parameters:

  • N: Number of tokens speculated in each cycle
  • P: Probability of a speculated token being accepted

Rejection Sampling in Speculative Decoding

Speculative decoding uses rejection sampling to maintain output quality while achieving speedup. For each speculated token, the acceptance probability is:

$$P = \min(1, \frac{q_i(j)}{p_i(j)})$$

Where q_i(j) is the main model's probability and p_i(j) is the draft model's probability for token j at position i. This ensures the final output maintains the main model's distribution quality.

Probabilistic Model for Token Acceptance

We model the expected number of additional tokens that can be successfully speculated after k tokens have already been accepted, based on the overall acceptance probability P.

Recursive Formulation

Let E_k be the expected number of additional tokens that can be successfully speculated after k tokens have already been accepted.

Base Case:

$$ E_N = 0 $$

(no more tokens can be speculated beyond N)

Recursive Relation: For 0 ≤ k < N:

$$E_k = P(1 + E_{k+1})$$

Solution

The closed-form solution for E_0 (expected additional tokens from the start):

$$E_0 = P \frac{1-P^N}{1-P}$$

This formula provides the expected number of additional tokens that can be successfully speculated when attempting to speculate N tokens ahead, given an overall acceptance probability P.

Speed Formula Derivation

Components

  1. Speculative Time: T_s · N (time to generate N speculative tokens)
  2. Verification Time: T_v (time to verify the batch of N tokens)

Total Process Time

For each speculation cycle:

$$T_{total} = T_s \cdot N + T_v$$

Expected Tokens per Cycle

The expected number of tokens per cycle is E_0 + 1, which simplifies to:

$$E_{tokens} = \frac{1-P^{N+1}}{1-P}$$

Generation Speed

$$\text{Generation Speed} = \frac{1 - P^{N+1}}{(T_s \cdot N + T_v) \cdot (1-P)}$$

This formula represents the number of tokens generated per unit time, providing a direct measure of the system's throughput.

Conclusion

Speculative decoding significantly speeds up token generation in LLMs through parallel processing. Key optimization insights include:

  • The draft model should find a balance between model size (Ts) and acceptance probability (P) to get high speed ups
  • Optimal number of speculated tokens (N) stays small unless your draft model have both very high acceptance rate and very fast generation

About

Speculative Decoding Optimization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages