vLLM PagedAttention: Efficient Memory Management for LLM Inference

Large Language Models (LLMs) have revolutionized AI, but serving them efficiently remains a challenge. vLLM's PagedAttention addresses the memory fragmentation problem in attention mechanisms, enabling significant improvements in throughput and memory utilization.

The Memory Fragmentation Problem

Traditional attention mechanisms allocate contiguous memory blocks for key-value (KV) caches. For a sequence of length \(L\), the memory requirement is:

\[ M_{\text{KV}} = 2 \times L \times d_{\text{model}} \times \text{batch\_size} \times \text{bytes\_per\_element} \]

where \(d_{\text{model}}\) is the model dimension. For a batch with variable sequence lengths, this leads to significant memory fragmentation.

The fragmentation ratio can be expressed as:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

where \(B\) is the batch size and \(L_i\) is the length of sequence \(i\).

PagedAttention Architecture

PagedAttention divides the KV cache into fixed-size pages, similar to virtual memory paging in operating systems. Each page contains \(P\) tokens, where typically \(P = 16\).

The number of pages required for a sequence of length \(L\) is:

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

The total memory allocation becomes:

\[ M_{\text{paged}} = N_{\text{pages}} \times P \times d_{\text{model}} \times 2 \times \text{bytes\_per\_element} \]

Attention Computation with Paging

The standard attention mechanism computes:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

With PagedAttention, we compute attention over pages. For query token \(q_i\) and page \(p_j\):

\[ \text{Attention}(q_i, K_{p_j}, V_{p_j}) = \text{softmax}\left(\frac{q_i K_{p_j}^T}{\sqrt{d_k}}\right) V_{p_j} \]

The final output aggregates over all pages:

\[ \text{Output}_i = \sum_{j=1}^{N_{\text{pages}}} \text{Attention}(q_i, K_{p_j}, V_{p_j}) \]

Memory Efficiency Gains

The memory efficiency improvement can be quantified. For a batch with average sequence length \(\bar{L}\) and maximum length \(L_{\max}\):

\[ \text{Efficiency Gain} = \frac{M_{\text{contiguous}}}{M_{\text{paged}}} = \frac{L_{\max} \times B}{\sum_{i=1}^{B} \left\lceil \frac{L_i}{P} \right\rceil \times P} \]

In the best case, when all sequences have length exactly \(L_{\max}\):

\[ \text{Efficiency Gain} = 1 \]

In the worst case with high variance:

\[ \text{Efficiency Gain} \approx \frac{L_{\max}}{\bar{L}} \]

Block Tables

PagedAttention uses block tables to map logical blocks to physical pages. For sequence \(s\) with \(N_s\) pages, the block table \(T_s\) maps:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

where \(i \in [0, N_s - 1]\). The memory overhead for block tables is:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

Prefill and Decode Phases

LLM inference consists of two phases:

Prefill Phase

During prefill, we process the entire prompt. The computational complexity is:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

With PagedAttention, we can parallelize across pages:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

Decode Phase

During decoding, we generate one token at a time. The attention computation for token \(t\) is:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

where the attention weights are:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

With PagedAttention, we only need to access pages containing tokens \([0, t-1]\):

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

Throughput Improvement

The throughput improvement comes from better memory utilization and reduced fragmentation. If we define throughput as:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

With PagedAttention, we can fit more sequences in memory:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

compared to:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

The throughput improvement is:

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

Continuous Batching

PagedAttention enables continuous batching, where new requests can start while others are still processing. The system maintains:

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

The memory can be dynamically allocated:

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

Performance Metrics

Key metrics for evaluating PagedAttention:

Memory Utilization

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]
Throughput per GPU

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]
Latency

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

Implementation Considerations

The page size \(P\) is a critical hyperparameter. Smaller pages reduce waste but increase overhead:

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

The optimal page size balances these factors.

Conclusion

PagedAttention revolutionizes LLM serving by:

Eliminating memory fragmentation through paged KV cache management
Enabling continuous batching for better GPU utilization
Providing predictable memory allocation patterns
Scaling efficiently with variable sequence lengths

The mathematical framework shows how paging transforms the memory allocation problem from \(O(L_{\max} \times B)\) to \(O(\sum \lceil L_i/P \rceil \times P)\), enabling significant throughput improvements in production LLM serving systems.