Chapter 26: GPU Benchmarking

Part VII: AI/HPC

"The GPU is the new CPU." — Jensen Huang

The First Time Everything Looks Wrong

Carlos had been a CPU performance engineer for eight years. When his company pivoted to AI, he was assigned to optimize their training infrastructure. "How different can GPUs be?" he thought. "Cores are cores. Memory is memory."

His first profiling session was humbling.

He ran Intel VTune out of habit—it showed barely any useful data. He tried perf—the GPU was invisible. He looked at nvidia-smi and saw "GPU Utilization: 73%." Was that good? Bad? What was the other 27% doing?

When Carlos finally got Nsight Compute working, the metrics were alien. "SM Occupancy: 42%." "Warp Stall: Memory." "L2 Hit Rate: 31%." Nothing mapped to his mental model built on branch prediction, instruction-level parallelism, and cache hierarchies.

"I feel like I'm starting from scratch," he told his new teammate, Wei, a GPU specialist.

"You kind of are," Wei admitted. "But the good news is, the fundamentals transfer. You still care about memory access patterns, instruction throughput, and utilization. It's just that everything is scaled up by a factor of 1000, and the vocabulary is different."

This chapter will help you navigate that transition.

Why GPU Profiling Is Different

GPU performance analysis differs from CPU analysis in three fundamental ways:

1. Parallelism at Unprecedented Scale

When Carlos worked on CPUs, "high parallelism" meant 64 threads across 32 cores. A GPU like the NVIDIA H100 runs over 270,000 threads simultaneously. The optimization strategies that work at 64-thread scale—careful load balancing, avoiding contention—become both more critical and harder to reason about at 270,000-thread scale.

System	Parallel Units	Threads	Ratio
Xeon 8490H	60 cores	120 threads	1x
H100 SXM5	132 SMs	270,336 threads	2,250x

2. Memory Bandwidth Dominates

CPUs are designed for low-latency access to small amounts of data. GPUs are designed for high-bandwidth access to large amounts of data. The H100 can sustain 3 TB/s memory bandwidth—30x higher than a high-end CPU. But that bandwidth is shared across all 270,000 threads, so per-thread bandwidth is actually lower.

3. Different Execution Model

CPUs execute threads independently. GPUs execute threads in groups called warps (NVIDIA) or wavefronts (AMD). All 32 threads in a warp execute the same instruction at the same time. When threads diverge (different branches), performance collapses.

Memory Hierarchy

GPU Memory Hierarchy:

┌─────────────────────────────────────────────────────────┐
│                    HBM (80 GB)                          │
│                    ~3 TB/s                              │
├─────────────────────────────────────────────────────────┤
│                    L2 Cache (50 MB)                     │
│                    ~12 TB/s                             │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │ Shared Mem  │  │ Shared Mem  │  │ Shared Mem  │ ... │
│  │  (228 KB)   │  │  (228 KB)   │  │  (228 KB)   │     │
│  │  ~20 TB/s   │  │  ~20 TB/s   │  │  ~20 TB/s   │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │  Registers  │  │  Registers  │  │  Registers  │ ... │
│  │  (256 KB)   │  │  (256 KB)   │  │  (256 KB)   │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
└─────────────────────────────────────────────────────────┘

Execution Model

CUDA Execution Model:

Grid
 └── Block (up to 1024 threads)
      └── Warp (32 threads, SIMT)
           └── Thread

Key concepts:
  - Warp is the minimum scheduling unit
  - Threads in same warp execute same instruction
  - Warp divergence reduces efficiency

CUDA Performance Analysis Tools

nvidia-smi

The most basic GPU monitoring tool:

# Real-time monitoring
nvidia-smi

# Continuous monitoring (update every second)
nvidia-smi -l 1

# Query specific metrics
nvidia-smi --query-gpu=utilization.gpu,memory.used,power.draw \
           --format=csv -l 1

Output example:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10    Driver Version: 535.86.10    CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA H100 80GB    On   | 00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0   115W / 700W |   1024MiB / 81559MiB |     45%      Default |
+-------------------------------+----------------------+----------------------+

Nsight Compute

Modern CUDA kernel analysis tool:

# Basic analysis
ncu ./my_cuda_app

# Detailed analysis (all sections)
ncu --set full ./my_cuda_app

# Analyze specific kernel
ncu --kernel-name "myKernel" ./my_cuda_app

# Output report
ncu -o report ./my_cuda_app

Nsight Systems

System-level profiling:

# Basic trace
nsys profile ./my_cuda_app

# Include CUDA API and kernels
nsys profile --trace=cuda,nvtx ./my_cuda_app

# Output report
nsys profile -o timeline ./my_cuda_app

Key Performance Metrics

SM Occupancy

Occupancy = Active Warps / Maximum Warps per SM

Influencing factors:
  - Registers used per thread
  - Shared memory used per block
  - Block size

### Tensor Core Generations

```text
Generation   GPU          Supported Precision       Matrix Size
─────────────────────────────────────────────────────────────────
V1 (Volta)   V100         FP16                      4×4×4
V2 (Turing)  RTX 20xx     FP16, INT8, INT4          8×8×4
V3 (Ampere)  A100         FP16, BF16, TF32,         8×8×4
                          FP64, INT8
V4 (Hopper)  H100         FP16, BF16, TF32,         16×8×16
                          FP8, FP64, INT8

Tensor Core Efficiency

Conditions for high Tensor Core efficiency:

1. Matrix dimension alignment
   - m, n, k should be multiples of 8 or 16
   - Depends on precision and GPU generation

2. Memory alignment
   - Matrix start address aligned to 16 bytes
   - Leading dimension aligned

3. Sufficient parallelism
   - Need enough tiles to fill GPU
   - Small matrices have low efficiency

Typical efficiency:
  Large matrices (4096+):  80-95% of peak
  Medium matrices (1024):  50-80% of peak
  Small matrices (256):    20-50% of peak

Practical Profiling Workflow

Step 1: System-Level Analysis

# Use Nsight Systems for overall view
nsys profile -o overview ./my_app

# View report
nsys-ui overview.nsys-rep

Identify:

CPU vs GPU time distribution
Kernel execution time
Data transfer overhead
Synchronization waits

Step 2: Identify Hotspots

From Nsight Systems report, find:

1. Longest-running kernels
2. Frequently called kernels
3. Data transfer bottlenecks
4. Unnecessary synchronization

Step 3: Deep Analysis

# Detailed analysis of specific kernel
ncu --set full \
    --kernel-name "hotKernel" \
    -o detailed_report \
    ./my_app

Step 4: Bottleneck Diagnosis

Nsight Compute provides bottleneck analysis:

Memory Bound:
  - High memory throughput
  - Low compute throughput
  - Solution: Optimize access patterns, use shared memory

Compute Bound:
  - High compute throughput
  - Low memory throughput
  - Solution: Algorithm optimization, use Tensor Cores

Latency Bound:
  - Low occupancy
  - High stall percentage
  - Solution: Increase parallelism, reduce dependencies

Instruction Bound:
  - Instruction issue becomes bottleneck
  - Solution: Reduce instruction count, use vectorization

Common Performance Issues

Warp Divergence

// Problematic code
__global__ void divergent(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx % 2 == 0) {
        // Even threads take this path
        data[idx] = expensive_function_a(data[idx]);
    } else {
        // Odd threads take this path
        data[idx] = expensive_function_b(data[idx]);
    }
}

// Threads in same warp take different branches
// Leads to serialized execution

Memory Coalescing

// Problem: Non-coalesced access
__global__ void strided(float* data, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    data[idx] = 1.0f;  // Strided access, low efficiency
}

// Solution: Coalesced access
__global__ void coalesced(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    data[idx] = 1.0f;  // Contiguous access, high efficiency
}

Bank Conflicts

// Problem: Shared memory bank conflict
__shared__ float smem[32][32];

// Threads in same warp access same bank
float val = smem[threadIdx.x][0];  // 32-way bank conflict

// Solution: Padding
__shared__ float smem[32][33];  // Add one column padding
float val = smem[threadIdx.x][0];  // No conflict

Low Occupancy

Diagnosis:
  ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active

Causes and solutions:
  1. Too many registers used
     - Reduce per-thread variables
     - Use __launch_bounds__

  2. Too much shared memory used
     - Reduce shared memory size
     - Use dynamic allocation

  3. Unsuitable block size
     - Adjust block dimensions
     - Ensure multiple of 32

Carlos's First Optimization Win

Three months into his GPU journey, Carlos achieved his first major optimization win.

The team's attention kernel was running at 35% of theoretical peak. After careful analysis with Nsight Compute, he identified the bottleneck: bank conflicts in shared memory during the softmax computation.

"In CPU terms," he explained to the team, "it's like having eight cores all trying to access the same cache line. On a GPU, that's 32 threads fighting for the same memory bank, and they have to take turns."

He restructured the data layout to eliminate the conflicts. The kernel jumped to 68% of peak—nearly doubling throughput.

"The weird thing," Carlos admitted, "is that once you understand the GPU execution model, this stuff becomes obvious. The tools show you exactly where the problem is. The hard part is learning to read what they're telling you."

Wei nodded. "Welcome to GPU performance. The tools are powerful. The bottlenecks are visible. The optimization strategies are well-documented. You just have to unlearn your CPU instincts first."

GPU benchmarking isn't harder than CPU benchmarking—it's different. The same principles apply: measure before optimizing, understand the hardware model, and let data guide your decisions. The vocabulary and tools are new, but the discipline is the same.

Summary

GPU Benchmarking requires understanding GPU's unique architecture:

Key Tools

nvidia-smi: Basic monitoring
Nsight Systems: System-level analysis
Nsight Compute: Kernel-level analysis
rocprof/Omniperf: AMD GPU

Key Metrics

Occupancy: Warp utilization
Memory Throughput: Memory bandwidth utilization
Compute Throughput: Compute unit utilization
Tensor Core Utilization: Matrix operation efficiency

Common Bottlenecks

Warp divergence
Non-coalesced memory access
Bank conflicts
Low occupancy

Best Practices

First use Nsight Systems for overall view
Identify hotspot kernels
Use Nsight Compute for deep analysis
Choose optimization strategy based on bottleneck type

Performance and Benchmarking