Chapter 26: GPU Benchmarking
Part VII: AI/HPC
"The GPU is the new CPU." — Jensen Huang
The First Time Everything Looks Wrong
Carlos had been a CPU performance engineer for eight years. When his company pivoted to AI, he was assigned to optimize their training infrastructure. "How different can GPUs be?" he thought. "Cores are cores. Memory is memory."
His first profiling session was humbling.
He ran Intel VTune out of habit—it showed barely any useful data. He tried perf—the GPU was invisible. He looked at nvidia-smi and saw "GPU Utilization: 73%." Was that good? Bad? What was the other 27% doing?
When Carlos finally got Nsight Compute working, the metrics were alien. "SM Occupancy: 42%." "Warp Stall: Memory." "L2 Hit Rate: 31%." Nothing mapped to his mental model built on branch prediction, instruction-level parallelism, and cache hierarchies.
"I feel like I'm starting from scratch," he told his new teammate, Wei, a GPU specialist.
"You kind of are," Wei admitted. "But the good news is, the fundamentals transfer. You still care about memory access patterns, instruction throughput, and utilization. It's just that everything is scaled up by a factor of 1000, and the vocabulary is different."
This chapter will help you navigate that transition.
Why GPU Profiling Is Different
GPU performance analysis differs from CPU analysis in three fundamental ways:
1. Parallelism at Unprecedented Scale
When Carlos worked on CPUs, "high parallelism" meant 64 threads across 32 cores. A GPU like the NVIDIA H100 runs over 270,000 threads simultaneously. The optimization strategies that work at 64-thread scale—careful load balancing, avoiding contention—become both more critical and harder to reason about at 270,000-thread scale.
| System | Parallel Units | Threads | Ratio |
|---|---|---|---|
| Xeon 8490H | 60 cores | 120 threads | 1x |
| H100 SXM5 | 132 SMs | 270,336 threads | 2,250x |
2. Memory Bandwidth Dominates
CPUs are designed for low-latency access to small amounts of data. GPUs are designed for high-bandwidth access to large amounts of data. The H100 can sustain 3 TB/s memory bandwidth—30x higher than a high-end CPU. But that bandwidth is shared across all 270,000 threads, so per-thread bandwidth is actually lower.
3. Different Execution Model
CPUs execute threads independently. GPUs execute threads in groups called warps (NVIDIA) or wavefronts (AMD). All 32 threads in a warp execute the same instruction at the same time. When threads diverge (different branches), performance collapses.
Memory Hierarchy
GPU Memory Hierarchy:
┌─────────────────────────────────────────────────────────┐
│ HBM (80 GB) │
│ ~3 TB/s │
├─────────────────────────────────────────────────────────┤
│ L2 Cache (50 MB) │
│ ~12 TB/s │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Shared Mem │ │ Shared Mem │ │ Shared Mem │ ... │
│ │ (228 KB) │ │ (228 KB) │ │ (228 KB) │ │
│ │ ~20 TB/s │ │ ~20 TB/s │ │ ~20 TB/s │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Registers │ │ Registers │ │ Registers │ ... │
│ │ (256 KB) │ │ (256 KB) │ │ (256 KB) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
Execution Model
CUDA Execution Model:
Grid
└── Block (up to 1024 threads)
└── Warp (32 threads, SIMT)
└── Thread
Key concepts:
- Warp is the minimum scheduling unit
- Threads in same warp execute same instruction
- Warp divergence reduces efficiency
CUDA Performance Analysis Tools
nvidia-smi
The most basic GPU monitoring tool:
# Real-time monitoring
nvidia-smi
# Continuous monitoring (update every second)
nvidia-smi -l 1
# Query specific metrics
nvidia-smi --query-gpu=utilization.gpu,memory.used,power.draw \
--format=csv -l 1
Output example:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA H100 80GB On | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 115W / 700W | 1024MiB / 81559MiB | 45% Default |
+-------------------------------+----------------------+----------------------+
Nsight Compute
Modern CUDA kernel analysis tool:
# Basic analysis
ncu ./my_cuda_app
# Detailed analysis (all sections)
ncu --set full ./my_cuda_app
# Analyze specific kernel
ncu --kernel-name "myKernel" ./my_cuda_app
# Output report
ncu -o report ./my_cuda_app
Nsight Systems
System-level profiling:
# Basic trace
nsys profile ./my_cuda_app
# Include CUDA API and kernels
nsys profile --trace=cuda,nvtx ./my_cuda_app
# Output report
nsys profile -o timeline ./my_cuda_app
Key Performance Metrics
SM Occupancy
Occupancy = Active Warps / Maximum Warps per SM
Influencing factors:
- Registers used per thread
- Shared memory used per block
- Block size
### Tensor Core Generations
```text
Generation GPU Supported Precision Matrix Size
─────────────────────────────────────────────────────────────────
V1 (Volta) V100 FP16 4×4×4
V2 (Turing) RTX 20xx FP16, INT8, INT4 8×8×4
V3 (Ampere) A100 FP16, BF16, TF32, 8×8×4
FP64, INT8
V4 (Hopper) H100 FP16, BF16, TF32, 16×8×16
FP8, FP64, INT8
Tensor Core Efficiency
Conditions for high Tensor Core efficiency:
1. Matrix dimension alignment
- m, n, k should be multiples of 8 or 16
- Depends on precision and GPU generation
2. Memory alignment
- Matrix start address aligned to 16 bytes
- Leading dimension aligned
3. Sufficient parallelism
- Need enough tiles to fill GPU
- Small matrices have low efficiency
Typical efficiency:
Large matrices (4096+): 80-95% of peak
Medium matrices (1024): 50-80% of peak
Small matrices (256): 20-50% of peak
Practical Profiling Workflow
Step 1: System-Level Analysis
# Use Nsight Systems for overall view
nsys profile -o overview ./my_app
# View report
nsys-ui overview.nsys-rep
Identify:
- CPU vs GPU time distribution
- Kernel execution time
- Data transfer overhead
- Synchronization waits
Step 2: Identify Hotspots
From Nsight Systems report, find:
1. Longest-running kernels
2. Frequently called kernels
3. Data transfer bottlenecks
4. Unnecessary synchronization
Step 3: Deep Analysis
# Detailed analysis of specific kernel
ncu --set full \
--kernel-name "hotKernel" \
-o detailed_report \
./my_app
Step 4: Bottleneck Diagnosis
Nsight Compute provides bottleneck analysis:
Memory Bound:
- High memory throughput
- Low compute throughput
- Solution: Optimize access patterns, use shared memory
Compute Bound:
- High compute throughput
- Low memory throughput
- Solution: Algorithm optimization, use Tensor Cores
Latency Bound:
- Low occupancy
- High stall percentage
- Solution: Increase parallelism, reduce dependencies
Instruction Bound:
- Instruction issue becomes bottleneck
- Solution: Reduce instruction count, use vectorization
Common Performance Issues
Warp Divergence
// Problematic code
__global__ void divergent(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx % 2 == 0) {
// Even threads take this path
data[idx] = expensive_function_a(data[idx]);
} else {
// Odd threads take this path
data[idx] = expensive_function_b(data[idx]);
}
}
// Threads in same warp take different branches
// Leads to serialized execution
Memory Coalescing
// Problem: Non-coalesced access
__global__ void strided(float* data, int stride) {
int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
data[idx] = 1.0f; // Strided access, low efficiency
}
// Solution: Coalesced access
__global__ void coalesced(float* data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = 1.0f; // Contiguous access, high efficiency
}
Bank Conflicts
// Problem: Shared memory bank conflict
__shared__ float smem[32][32];
// Threads in same warp access same bank
float val = smem[threadIdx.x][0]; // 32-way bank conflict
// Solution: Padding
__shared__ float smem[32][33]; // Add one column padding
float val = smem[threadIdx.x][0]; // No conflict
Low Occupancy
Diagnosis:
ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active
Causes and solutions:
1. Too many registers used
- Reduce per-thread variables
- Use __launch_bounds__
2. Too much shared memory used
- Reduce shared memory size
- Use dynamic allocation
3. Unsuitable block size
- Adjust block dimensions
- Ensure multiple of 32
Carlos's First Optimization Win
Three months into his GPU journey, Carlos achieved his first major optimization win.
The team's attention kernel was running at 35% of theoretical peak. After careful analysis with Nsight Compute, he identified the bottleneck: bank conflicts in shared memory during the softmax computation.
"In CPU terms," he explained to the team, "it's like having eight cores all trying to access the same cache line. On a GPU, that's 32 threads fighting for the same memory bank, and they have to take turns."
He restructured the data layout to eliminate the conflicts. The kernel jumped to 68% of peak—nearly doubling throughput.
"The weird thing," Carlos admitted, "is that once you understand the GPU execution model, this stuff becomes obvious. The tools show you exactly where the problem is. The hard part is learning to read what they're telling you."
Wei nodded. "Welcome to GPU performance. The tools are powerful. The bottlenecks are visible. The optimization strategies are well-documented. You just have to unlearn your CPU instincts first."
GPU benchmarking isn't harder than CPU benchmarking—it's different. The same principles apply: measure before optimizing, understand the hardware model, and let data guide your decisions. The vocabulary and tools are new, but the discipline is the same.
Summary
GPU Benchmarking requires understanding GPU's unique architecture:
Key Tools
- nvidia-smi: Basic monitoring
- Nsight Systems: System-level analysis
- Nsight Compute: Kernel-level analysis
- rocprof/Omniperf: AMD GPU
Key Metrics
- Occupancy: Warp utilization
- Memory Throughput: Memory bandwidth utilization
- Compute Throughput: Compute unit utilization
- Tensor Core Utilization: Matrix operation efficiency
Common Bottlenecks
- Warp divergence
- Non-coalesced memory access
- Bank conflicts
- Low occupancy
Best Practices
- First use Nsight Systems for overall view
- Identify hotspot kernels
- Use Nsight Compute for deep analysis
- Choose optimization strategy based on bottleneck type