Large Language Models (LLMs) have revolutionized AI, but serving them efficiently remains a challenge. vLLM's PagedAttention addresses the memory fragmentation problem in attention mechanisms, enabling significant improvements in throughput and memory utilization.
The Memory Fragmentation Problem
Traditional attention mechanisms allocate contiguous memory blocks for key-value (KV) caches. For a sequence of length \(L\), the memory requirement is:
where \(d_{\text{model}}\) is the model dimension. For a batch with variable sequence lengths, this leads to significant memory fragmentation.
The fragmentation ratio can be expressed as:
where \(B\) is the batch size and \(L_i\) is the length of sequence \(i\).
PagedAttention Architecture
PagedAttention divides the KV cache into fixed-size pages, similar to virtual memory paging in operating systems. Each page contains \(P\) tokens, where typically \(P = 16\).
The number of pages required for a sequence of length \(L\) is:
The total memory allocation becomes:
Attention Computation with Paging
The standard attention mechanism computes:
With PagedAttention, we compute attention over pages. For query token \(q_i\) and page \(p_j\):
The final output aggregates over all pages:
Memory Efficiency Gains
The memory efficiency improvement can be quantified. For a batch with average sequence length \(\bar{L}\) and maximum length \(L_{\max}\):
In the best case, when all sequences have length exactly \(L_{\max}\):
In the worst case with high variance:
Block Tables
PagedAttention uses block tables to map logical blocks to physical pages. For sequence \(s\) with \(N_s\) pages, the block table \(T_s\) maps:
where \(i \in [0, N_s - 1]\). The memory overhead for block tables is:
Prefill and Decode Phases
LLM inference consists of two phases:
Prefill Phase
During prefill, we process the entire prompt. The computational complexity is:
With PagedAttention, we can parallelize across pages:
Decode Phase
During decoding, we generate one token at a time. The attention computation for token \(t\) is:
where the attention weights are:
With PagedAttention, we only need to access pages containing tokens \([0, t-1]\):
Throughput Improvement
The throughput improvement comes from better memory utilization and reduced fragmentation. If we define throughput as:
With PagedAttention, we can fit more sequences in memory:
compared to:
The throughput improvement is:
Continuous Batching
PagedAttention enables continuous batching, where new requests can start while others are still processing. The system maintains:
The memory can be dynamically allocated:
Performance Metrics
Key metrics for evaluating PagedAttention:
-
Memory Utilization
\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \] -
Throughput per GPU
\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \] -
Latency
\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]
Implementation Considerations
The page size \(P\) is a critical hyperparameter. Smaller pages reduce waste but increase overhead:
The optimal page size balances these factors.
Conclusion
PagedAttention revolutionizes LLM serving by:
- Eliminating memory fragmentation through paged KV cache management
- Enabling continuous batching for better GPU utilization
- Providing predictable memory allocation patterns
- Scaling efficiently with variable sequence lengths
The mathematical framework shows how paging transforms the memory allocation problem from \(O(L_{\max} \times B)\) to \(O(\sum \lceil L_i/P \rceil \times P)\), enabling significant throughput improvements in production LLM serving systems.