Pipeline parallelism makes it possible to train large models that don't fit into a single GPU's memory.CITATION_1

Example: Huggingface's BLOOM model[4][4] Hugging Face. "The Technology Behind BLOOM Training." Hugging Face Blog, 2022. is a 175B parameter Transformer model. Storing the weights as bfloat16 requires 350GB, but the GPUs they used to train BLOOM 'only' have 80GB of memory, and training requires much more memory than just loading the model weights. So their final training was distributed across 384 GPUs.

This is made possible by assigning different layers of the model to different GPUs, a process called model partitioning. Implemented naively, model partitioning results in low GPU utilization. In this post, we'll first discuss the naive implementation of pipeline parallelism and some of its problems. Then, we'll talk about GPipe and PipeDream, two more recent algorithms that alleviate some of the issues with naive pipeline parallelism.

Naive Model Parallelism

Naive model parallelism is the most straightforward way of implementing pipeline-parallel training. We split our model into multiple parts, and assign each one to a GPU. Then we run regular training on minibatches, inserting communication steps at the boundaries where we've split the model.

Let's take this 4-layer sequential model as an example:

\[ \text{output}=\text{L}_4(\text{L}_3(\text{L}_2(\text{L}_1(\text{input})))) \]

We split the computation among two GPUs as follows:

The communication cost for transferring activations between GPUs can be expressed as:

\[ C_{\text{comm}} = \frac{\text{Activation Size}}{\text{Bandwidth}} = \frac{b \times d \times \text{bytes\_per\_element}}{B} \]

where \(b\) is the batch size, \(d\) is the hidden dimension, and \(B\) is the inter-GPU bandwidth. The computation time for a layer is:

\[ T_{\text{comp}} = \frac{\text{FLOPS}}{\text{Peak FLOPS}} = \frac{2 \times b \times d^2}{F_{\text{peak}}} \]

For pipeline parallelism to be beneficial, we need:

\[ T_{\text{comp}} > C_{\text{comm}} \implies \frac{2bd^2}{F_{\text{peak}}} > \frac{bd \times \text{bytes}}{B} \]

Simplifying:

\[ \frac{2d}{F_{\text{peak}}} > \frac{\text{bytes}}{B} \implies d > \frac{F_{\text{peak}} \times \text{bytes}}{2B} \]

To complete a forward pass, we compute intermediate on GPU1 and transfer the resulting tensor to GPU2. GPU2 then computes the output of the model and starts the backward pass. For the backward pass, we send the gradients w.r.t. intermediate from GPU2 to GPU1. GPU1 then completes the backward pass based on the gradients it was sent. This way, the model parallel training results in the same outputs and gradients as single-node training.

Because the sending doesn't modify any bits, naive model-parallel training is, unlike data-parallel training, bit-equal to sequential training. This makes debugging much easier.

Code Example

Here's a simple example of how you might implement this:

def forward_pass(input_data, model_partitions):
    """Naive pipeline parallel forward pass"""
    activations = input_data

    for i, partition in enumerate(model_partitions):
        activations = partition(activations)
        if i < len(model_partitions) - 1:
            # Send activations to next GPU
            activations = send_to_next_gpu(activations)

    return activations

The GPipe Algorithm: Splitting Minibatches into Microbatches

GPipe[1][1] Huang, Y., et al. "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." Advances in Neural Information Processing Systems, 2019. arXiv:1811.06965 increases efficiency by splitting each minibatch into even smaller, equal-sized microbatches. We can then compute the forward and backward pass independently for each microbatch.

Pipeline Parallel Training Visualization

The diagram above illustrates how GPipe schedules work across multiple GPUs, showing the interleaving of forward and backward passes.

As long as there is no batch norm. It's possible to use batchnorm and GPipe by computing the normalizing statistics over the microbatch, which often works but isn't equal to sequential training anymore.

If we sum up the gradients for each microbatch, we get back the gradient over the whole batch. Because, just like for data parallel training, the gradient of a sum is the sum of the gradients of each term. This process is called gradient accumulation.

Mathematically, if we have \(M\) microbatches, the gradient over the full batch is:

\[ \nabla \mathcal{L}_{\text{full}} = \sum_{m=1}^{M} \nabla \mathcal{L}_m \]

where \(\nabla \mathcal{L}_m\) is the gradient computed on microbatch \(m\). The gradient accumulation ensures:

\[ \nabla W = \frac{1}{M} \sum_{m=1}^{M} \nabla W_m = \frac{1}{M} \sum_{m=1}^{M} \frac{\partial \mathcal{L}_m}{\partial W} \]

This is equivalent to computing the gradient on the full batch:

\[ \nabla W = \frac{\partial}{\partial W} \left(\frac{1}{M} \sum_{m=1}^{M} \mathcal{L}_m\right) = \frac{1}{M} \sum_{m=1}^{M} \frac{\partial \mathcal{L}_m}{\partial W} \]

GPipe Schedule

Let's consider a model partitioned across 4 GPUs. With naive pipeline parallelism, the resulting schedule would look like this:

Timestep 0 1 2 3 4 5 6 7
GPU3 FWD BWD
GPU2 FWD BWD
GPU1 FWD BWD
GPU0 FWD BWD

With GPipe we now split our minibatch into microbatches, let's say 4 of them:

Timestep 0 1 2 3 4 5 6 7 8 9 10 11 12 13
GPU3 F1 F2 F3 F4 B4 B3 B2 B1
GPU2 F1 F2 F3 F4 B4 B3 B2 B1
GPU1 F1 F2 F3 F4 B4 B3 B2 B1
GPU0 F1 F2 F3 F4 B4 B3 B2 B1

Here F1 means performing the forward pass of microbatch1 using the layer partition stored on the current GPU. For more details on the implementation, see the original GPipe paper[1][1] Huang, Y., et al. "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." Advances in Neural Information Processing Systems, 2019. arXiv:1811.06965 and PipeDream[2][2] Narayanan, D., et al. "PipeDream: Generalized Pipeline Parallelism for DNN Training." Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019. arXiv:1806.03377 for alternative approaches.

Conclusion

Pipeline parallelism is a way of training large models that do not fit into a single GPU's memory, by partitioning the model's layers across GPUs. We perform GPU-to-GPU communication between the model partitions during the forward pass (to send activations) and the backward pass (to send gradients).

We saw how naive model parallelism suffers from poor GPU utilization. This is alleviated by GPipe, which splits minibatches into smaller microbatches, keeping multiple GPUs busy at any given time.