Large Language Models (LLMs) have revolutionized AI, but serving them efficiently remains a challenge. vLLM's PagedAttention addresses the memory fragmentation problem in attention mechanisms, enabling significant improvements in throughput and memory utilization.

The Memory Fragmentation Problem

Traditional attention mechanisms allocate contiguous memory blocks for key-value (KV) caches. For a sequence of length \(L\), the memory requirement is:

\[ M_{\text{KV}} = 2 \times L \times d_{\text{model}} \times \text{batch\_size} \times \text{bytes\_per\_element} \]

where \(d_{\text{model}}\) is the model dimension. For a batch with variable sequence lengths, this leads to significant memory fragmentation.

The fragmentation ratio can be expressed as:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

where \(B\) is the batch size and \(L_i\) is the length of sequence \(i\).

PagedAttention Architecture

PagedAttention divides the KV cache into fixed-size pages, similar to virtual memory paging in operating systems. Each page contains \(P\) tokens, where typically \(P = 16\).

The number of pages required for a sequence of length \(L\) is:

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

The total memory allocation becomes:

\[ M_{\text{paged}} = N_{\text{pages}} \times P \times d_{\text{model}} \times 2 \times \text{bytes\_per\_element} \]

Attention Computation with Paging

The standard attention mechanism computes:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

With PagedAttention, we compute attention over pages. For query token \(q_i\) and page \(p_j\):

\[ \text{Attention}(q_i, K_{p_j}, V_{p_j}) = \text{softmax}\left(\frac{q_i K_{p_j}^T}{\sqrt{d_k}}\right) V_{p_j} \]

The final output aggregates over all pages:

\[ \text{Output}_i = \sum_{j=1}^{N_{\text{pages}}} \text{Attention}(q_i, K_{p_j}, V_{p_j}) \]

Memory Efficiency Gains

The memory efficiency improvement can be quantified. For a batch with average sequence length \(\bar{L}\) and maximum length \(L_{\max}\):

\[ \text{Efficiency Gain} = \frac{M_{\text{contiguous}}}{M_{\text{paged}}} = \frac{L_{\max} \times B}{\sum_{i=1}^{B} \left\lceil \frac{L_i}{P} \right\rceil \times P} \]

In the best case, when all sequences have length exactly \(L_{\max}\):

\[ \text{Efficiency Gain} = 1 \]

In the worst case with high variance:

\[ \text{Efficiency Gain} \approx \frac{L_{\max}}{\bar{L}} \]

Block Tables

PagedAttention uses block tables to map logical blocks to physical pages. For sequence \(s\) with \(N_s\) pages, the block table \(T_s\) maps:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

where \(i \in [0, N_s - 1]\). The memory overhead for block tables is:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

Prefill and Decode Phases

LLM inference consists of two phases:

Prefill Phase

During prefill, we process the entire prompt. The computational complexity is:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

With PagedAttention, we can parallelize across pages:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

Decode Phase

During decoding, we generate one token at a time. The attention computation for token \(t\) is:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

where the attention weights are:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

With PagedAttention, we only need to access pages containing tokens \([0, t-1]\):

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

Throughput Improvement

The throughput improvement comes from better memory utilization and reduced fragmentation. If we define throughput as:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

With PagedAttention, we can fit more sequences in memory:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

compared to:

\[ \text{Fragmentation} = 1 - \frac{\text{Used Memory}}{\text{Allocated Memory}} = 1 - \frac{\sum_{i=1}^{B} L_i}{\max(L_i) \times B} \]

The throughput improvement is:

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

Continuous Batching

PagedAttention enables continuous batching, where new requests can start while others are still processing. The system maintains:

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

The memory can be dynamically allocated:

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

Performance Metrics

Key metrics for evaluating PagedAttention:

  1. Memory Utilization

    \[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

  2. Throughput per GPU

    \[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

  3. Latency

    \[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

Implementation Considerations

The page size \(P\) is a critical hyperparameter. Smaller pages reduce waste but increase overhead:

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

\[ N_{\text{pages}} = \left\lceil \frac{L}{P} \right\rceil \]

The optimal page size balances these factors.

Conclusion

PagedAttention revolutionizes LLM serving by:

The mathematical framework shows how paging transforms the memory allocation problem from \(O(L_{\max} \times B)\) to \(O(\sum \lceil L_i/P \rceil \times P)\), enabling significant throughput improvements in production LLM serving systems.