Chapter 32: Case Study: ML Inference Optimization
Part VIII: Case Studies
"Training is science. Inference is engineering." — An ML Engineer
The Story of "GPU Utilization at Only 3%"
We deployed an image classification service using ResNet-50. Hardware was NVIDIA A100 GPU—theoretically 312 TFLOPS (TF32).
But measured: only 50 images per second.
A100 processing one image requires about 8 GFLOPs. Theoretically:
312 TFLOPS / 8 GFLOPs = 39,000 images/sec
We achieved only 50 images/sec.
GPU utilization: 0.13%
Where's the problem?
1. Image transfer from CPU to GPU: ~5ms
2. GPU computation: ~0.2ms
3. Result transfer from GPU to CPU: ~0.1ms
4. Python overhead: ~10ms
5. Image decoding (CPU): ~5ms
────────────────────────────────────
Total: ~20ms per image
GPU computation is only 1%. The other 99% is spent "feeding data."
This is the core challenge of ML inference optimization: keeping the GPU busy.
ML Inference Characteristics
Training vs Inference
| Characteristic | Training | Inference |
|---|---|---|
| Batch size | Large (32-4096) | Small (1-64) |
| Latency requirement | Not important | Critical |
| Precision requirement | FP32/BF16 | Can be lower (INT8) |
| Frequency | Once | Continuous |
| Optimization goal | Throughput | Latency + Throughput |
Common Bottlenecks
┌─────────────────────────────────────────────────────────────────┐
│ Inference Pipeline │
├─────────┬──────────┬──────────┬──────────┬──────────┬──────────┤
│ Input │ Preproc │ H2D Copy │ Compute │ D2H Copy │ Postproc │
│ (I/O) │ (CPU) │ (PCIe) │ (GPU) │ (PCIe) │ (CPU) │
└─────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Common bottlenecks:
1. I/O: Reading data
2. CPU preprocessing: Decode, resize, normalize
3. PCIe transfer: CPU ↔ GPU
4. GPU compute: Model inference
5. CPU postprocessing: NMS, decoding
Measurement Tools
NVIDIA Nsight Systems
# Record profile
nsys profile -o report python inference.py
# View report
nsys-ui report.nsys-rep
Nsight Systems shows:
- GPU kernel execution time
- CPU/GPU synchronization points
- Memory copy (H2D/D2H)
- CUDA API calls
PyTorch Profiler
import torch
from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for i in range(100):
with record_function("inference"):
output = model(input_tensor)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
Optimization Strategies
Optimization 1: Batching
Single inference cannot fully utilize GPU's parallel capability:
# Original: process one by one
for image in images:
result = model(image.unsqueeze(0)) # batch_size = 1
# Optimized: batch processing
batch = torch.stack(images) # batch_size = 32
results = model(batch)
Batching effect:
Batch Size Latency (ms) Throughput (img/s) GPU Util
─────────────────────────────────────────────────────────────
1 5.2 192 15%
8 8.1 988 45%
32 18.5 1730 78%
128 62.3 2055 92%
256 118.7 2157 95%
Batch size increases, throughput increases, but latency also increases.
Trade-off: Latency vs Throughput
Optimization 2: Reduce CPU-GPU Data Transfer
Use Pinned Memory
# Normal memory → GPU: requires extra copy
tensor = torch.randn(batch_size, 3, 224, 224)
tensor_gpu = tensor.to('cuda') # slow
### Optimization 4: Quantization
**FP32 → FP16**
```python
# PyTorch automatic mixed precision
model = model.half() # Convert to FP16
input_tensor = input_tensor.half()
output = model(input_tensor)
FP32 → INT8 (requires calibration)
import torch.quantization as quant
# Prepare for quantization
model.qconfig = quant.get_default_qconfig('fbgemm')
quant.prepare(model, inplace=True)
# Calibrate (with representative data)
with torch.no_grad():
for data in calibration_loader:
model(data)
# Convert
quant.convert(model, inplace=True)
Quantization effect:
Precision Model Size Latency Accuracy Drop
───────────────────────────────────────────────────
FP32 98 MB 5.2 ms baseline
FP16 49 MB 2.8 ms ~0%
INT8 25 MB 1.5 ms 0.5-1%
Optimization 5: Preprocessing Optimization
CPU preprocessing is often the bottleneck:
# Original: PIL + torchvision (slow)
from PIL import Image
from torchvision import transforms
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
image = Image.open("image.jpg")
tensor = transform(image)
Use NVIDIA DALI (GPU preprocessing)
from nvidia.dali import pipeline_def, fn
import nvidia.dali.types as types
@pipeline_def
def image_pipeline():
jpegs, labels = fn.readers.file(file_root="images/")
images = fn.decoders.image(jpegs, device="mixed") # GPU decode
images = fn.resize(images, size=[256, 256])
images = fn.crop(images, crop=[224, 224])
images = fn.normalize(images,
mean=[0.485*255, 0.456*255, 0.406*255],
std=[0.229*255, 0.224*255, 0.225*255])
return images, labels
pipe = image_pipeline(batch_size=32, num_threads=4, device_id=0)
pipe.build()
LLM Inference Special Optimizations
Large language models have unique challenges:
Memory-bound Problem
LLaMA-7B parameters: 7B × 2 bytes (FP16) = 14 GB
A100 memory bandwidth: 2 TB/s
Each token generation requires reading all parameters
Theoretical max speed: 2000 GB/s ÷ 14 GB = 143 tokens/sec
Plus KV cache read/write overhead
KV Cache Optimization
# KV cache uses significant memory
# Sequence length 2048, batch 16, 32 layers, hidden 4096 each
# KV cache size = 2 × 16 × 32 × 2048 × 4096 × 2 bytes = 8.6 GB
# Use PagedAttention (vLLM)
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)
Speculative Decoding
Use small model to "guess" multiple tokens, large model verifies:
Traditional: Large model generates 1 token → verify → generate next → ...
Speculative:
1. Small model quickly generates 4 tokens
2. Large model verifies all 4 at once
3. If 3 are correct, saves 2 large model calls
Summary
ML Inference Challenges
GPU computation is fast, but:
1. CPU-GPU data transfer is slow
2. CPU preprocessing is slow
3. Small batches can't fully utilize GPU
Optimization Strategies
| Problem | Solution |
|---|---|
| Low GPU utilization | Batching, dynamic batching |
| Slow data transfer | Pinned memory, CUDA streams |
| Model too large | Quantization (FP16/INT8), distillation |
| Preprocessing bottleneck | DALI (GPU preprocessing) |
| Framework overhead | TensorRT, ONNX Runtime |
LLM Special Optimizations
- KV cache management (PagedAttention)
- Speculative decoding
- Continuous batching
Tools
- Profiling: Nsight Systems, PyTorch Profiler
- Runtime: TensorRT, ONNX Runtime, vLLM
- Serving: Triton Inference Server
Remember
Theoretical TFLOPS ≠ actual performance
Real bottlenecks are usually:
1. Data movement
2. Memory bandwidth
3. Software overhead
Optimization order:
Pipeline → Batching → Quantization → Model architecture