Graphics Processing Units (GPUs) have become the cornerstone of modern deep learning and high-performance computing. Understanding CUDA primitives is essential for writing efficient GPU code and optimizing neural network operations.

GPU Architecture Overview

Modern GPUs consist of thousands of cores organized into Streaming Multiprocessors (SMs). Each SM contains multiple CUDA cores that execute threads in groups called warps. The key to GPU performance lies in maximizing parallelism and minimizing memory access latency.

The theoretical peak performance of a GPU can be calculated as:

\[ \text{Peak FLOPS} = \text{Number of SMs} \times \text{Cores per SM} \times \text{Clock Frequency} \times \text{Operations per Cycle} \]

For example, an NVIDIA A100 GPU with 108 SMs, 64 cores per SM, running at 1.41 GHz, can theoretically achieve:

\[ \text{Peak FLOPS} = 108 \times 64 \times 1.41 \times 10^9 \times 2 = 19.5 \text{ TFLOPS} \]

where the factor of 2 accounts for fused multiply-add (FMA) operations.

CUDA Memory Hierarchy

CUDA provides several memory types, each with different characteristics:

  1. Global Memory: Large, high-latency, accessible by all threads
  2. Shared Memory: Fast, on-chip, shared among threads in a block
  3. Registers: Fastest, private to each thread
  4. Constant Memory: Cached read-only memory
  5. Texture Memory: Optimized for 2D spatial locality

The memory bandwidth can be estimated using:

\[ \text{Bandwidth} = \frac{\text{Data Transferred}}{\text{Time}} = \frac{N \times \text{Size per Element}}{\text{Latency} + \frac{N \times \text{Size per Element}}{\text{Peak Bandwidth}}} \]

Parallel Reduction Pattern

One of the most fundamental CUDA patterns is parallel reduction. Given an array of \(N\) elements, we want to compute:

\[ \text{Sum} = \sum_{i=0}^{N-1} a[i] \]

The sequential algorithm has \(O(N)\) time complexity. Using parallel reduction with \(P\) processors:

\[ T_{\text{parallel}} = O\left(\frac{N}{P} + \log P\right) \]

Each reduction step halves the number of active threads:

\[ \text{Active Threads at step } k = \left\lfloor \frac{N}{2^k} \right\rfloor \]

Reduction Implementation

__global__ void reduce_kernel(float* input, float* output, int n) {
    extern __shared__ float sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = (i < n) ? input[i] : 0;
    __syncthreads();

    for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) output[blockIdx.x] = sdata[0];
}

Matrix Multiplication Optimization

Matrix multiplication is central to deep learning. The naive implementation has complexity:

\[ C_{ij} = \sum_{k=0}^{K-1} A_{ik} \times B_{kj} \]

For matrices \(A \in \mathbb{R}^{M \times K}\) and \(B \in \mathbb{R}^{K \times N}\), this requires \(O(MNK)\) operations.

Using shared memory tiling, we can improve memory access patterns. If we use a tile size of \(T \times T\):

\[ \text{Memory Accesses} = \frac{MN}{T^2} \times (2T^2 + T^2) = \frac{3MN}{T} \]

compared to \(2MNK\) for the naive approach, giving a speedup of:

\[ \text{Speedup} = \frac{2MNK}{3MN/T} = \frac{2KT}{3} \]

Warp-Level Primitives

CUDA warps consist of 32 threads that execute in lockstep. Warp-level primitives enable efficient communication:

Shuffle Operations

The shuffle operation allows threads within a warp to exchange data:

\[ \text{Shuffle}(x, \text{lane}) = x_{\text{lane}} \]

This enables efficient reductions within a warp:

\[ \text{Peak FLOPS} = 108 \times 64 \times 1.41 \times 10^9 \times 2 = 19.5 \text{ TFLOPS} \]

Warp Reduce Complexity

For a warp reduction, we need \(\log_2(32) = 5\) steps:

\[ \text{Peak FLOPS} = 108 \times 64 \times 1.41 \times 10^9 \times 2 = 19.5 \text{ TFLOPS} \]

Memory Coalescing

Optimal memory access patterns are crucial. For coalesced access, consecutive threads access consecutive memory locations:

\[ \text{Peak FLOPS} = 108 \times 64 \times 1.41 \times 10^9 \times 2 = 19.5 \text{ TFLOPS} \]

where \(\text{Stride} = 1\) for optimal coalescing. The bandwidth utilization is:

\[ \text{Peak FLOPS} = 108 \times 64 \times 1.41 \times 10^9 \times 2 = 19.5 \text{ TFLOPS} \]

Occupancy and Resource Limits

GPU occupancy measures how many warps can run concurrently on an SM. The occupancy is limited by:

\[ \text{Peak FLOPS} = 108 \times 64 \times 1.41 \times 10^9 \times 2 = 19.5 \text{ TFLOPS} \]

For optimal performance, we typically aim for occupancy > 50%.

Performance Metrics

Key metrics for GPU performance evaluation:

  1. Throughput: Operations per second

    \[ \text{Peak FLOPS} = 108 \times 64 \times 1.41 \times 10^9 \times 2 = 19.5 \text{ TFLOPS} \]

  2. Efficiency: Actual vs theoretical performance

    \[ \text{Peak FLOPS} = 108 \times 64 \times 1.41 \times 10^9 \times 2 = 19.5 \text{ TFLOPS} \]

  3. Memory Bandwidth Utilization

    \[ \text{Peak FLOPS} = 108 \times 64 \times 1.41 \times 10^9 \times 2 = 19.5 \text{ TFLOPS} \]

Conclusion

Understanding CUDA primitives and GPU architecture is essential for writing efficient parallel code. Key takeaways:

The mathematical relationships between parallelism, memory access patterns, and performance provide a framework for optimizing GPU kernels.