Chapter 32: Case Study: ML Inference Optimization

Part VIII: Case Studies

"Training is science. Inference is engineering." — An ML Engineer

The Story of "GPU Utilization at Only 3%"

We deployed an image classification service using ResNet-50. Hardware was NVIDIA A100 GPU—theoretically 312 TFLOPS (TF32).

But measured: only 50 images per second.

A100 processing one image requires about 8 GFLOPs. Theoretically:

312 TFLOPS / 8 GFLOPs = 39,000 images/sec

We achieved only 50 images/sec.

GPU utilization: 0.13%

Where's the problem?

1. Image transfer from CPU to GPU: ~5ms
2. GPU computation: ~0.2ms
3. Result transfer from GPU to CPU: ~0.1ms
4. Python overhead: ~10ms
5. Image decoding (CPU): ~5ms
────────────────────────────────────
Total: ~20ms per image

GPU computation is only 1%. The other 99% is spent "feeding data."

This is the core challenge of ML inference optimization: keeping the GPU busy.

ML Inference Characteristics

Training vs Inference

Characteristic	Training	Inference
Batch size	Large (32-4096)	Small (1-64)
Latency requirement	Not important	Critical
Precision requirement	FP32/BF16	Can be lower (INT8)
Frequency	Once	Continuous
Optimization goal	Throughput	Latency + Throughput

Common Bottlenecks

┌─────────────────────────────────────────────────────────────────┐
│                    Inference Pipeline                           │
├─────────┬──────────┬──────────┬──────────┬──────────┬──────────┤
│  Input  │ Preproc  │ H2D Copy │  Compute │ D2H Copy │ Postproc │
│  (I/O)  │  (CPU)   │  (PCIe)  │  (GPU)   │  (PCIe)  │  (CPU)   │
└─────────┴──────────┴──────────┴──────────┴──────────┴──────────┘

Common bottlenecks:
1. I/O: Reading data
2. CPU preprocessing: Decode, resize, normalize
3. PCIe transfer: CPU ↔ GPU
4. GPU compute: Model inference
5. CPU postprocessing: NMS, decoding

Measurement Tools

NVIDIA Nsight Systems

# Record profile
nsys profile -o report python inference.py

# View report
nsys-ui report.nsys-rep

Nsight Systems shows:

GPU kernel execution time
CPU/GPU synchronization points
Memory copy (H2D/D2H)
CUDA API calls

PyTorch Profiler

import torch
from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for i in range(100):
        with record_function("inference"):
            output = model(input_tensor)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Optimization Strategies

Optimization 1: Batching

Single inference cannot fully utilize GPU's parallel capability:

# Original: process one by one
for image in images:
    result = model(image.unsqueeze(0))  # batch_size = 1

# Optimized: batch processing
batch = torch.stack(images)  # batch_size = 32
results = model(batch)

Batching effect:

Batch Size    Latency (ms)    Throughput (img/s)    GPU Util
─────────────────────────────────────────────────────────────
    1              5.2              192               15%
    8              8.1              988               45%
   32             18.5             1730               78%
  128             62.3             2055               92%
  256            118.7             2157               95%

Batch size increases, throughput increases, but latency also increases.

Trade-off: Latency vs Throughput

Optimization 2: Reduce CPU-GPU Data Transfer

Use Pinned Memory

# Normal memory → GPU: requires extra copy
tensor = torch.randn(batch_size, 3, 224, 224)
tensor_gpu = tensor.to('cuda')  # slow


### Optimization 4: Quantization

**FP32 → FP16**

```python
# PyTorch automatic mixed precision
model = model.half()  # Convert to FP16
input_tensor = input_tensor.half()
output = model(input_tensor)

FP32 → INT8 (requires calibration)

import torch.quantization as quant

# Prepare for quantization
model.qconfig = quant.get_default_qconfig('fbgemm')
quant.prepare(model, inplace=True)

# Calibrate (with representative data)
with torch.no_grad():
    for data in calibration_loader:
        model(data)

# Convert
quant.convert(model, inplace=True)

Quantization effect:

Precision    Model Size    Latency    Accuracy Drop
───────────────────────────────────────────────────
FP32         98 MB         5.2 ms     baseline
FP16         49 MB         2.8 ms     ~0%
INT8         25 MB         1.5 ms     0.5-1%

Optimization 5: Preprocessing Optimization

CPU preprocessing is often the bottleneck:

# Original: PIL + torchvision (slow)
from PIL import Image
from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

image = Image.open("image.jpg")
tensor = transform(image)

Use NVIDIA DALI (GPU preprocessing)

from nvidia.dali import pipeline_def, fn
import nvidia.dali.types as types

@pipeline_def
def image_pipeline():
    jpegs, labels = fn.readers.file(file_root="images/")
    images = fn.decoders.image(jpegs, device="mixed")  # GPU decode
    images = fn.resize(images, size=[256, 256])
    images = fn.crop(images, crop=[224, 224])
    images = fn.normalize(images,
                         mean=[0.485*255, 0.456*255, 0.406*255],
                         std=[0.229*255, 0.224*255, 0.225*255])
    return images, labels

pipe = image_pipeline(batch_size=32, num_threads=4, device_id=0)
pipe.build()

LLM Inference Special Optimizations

Large language models have unique challenges:

Memory-bound Problem

LLaMA-7B parameters: 7B × 2 bytes (FP16) = 14 GB
A100 memory bandwidth: 2 TB/s
Each token generation requires reading all parameters

Theoretical max speed: 2000 GB/s ÷ 14 GB = 143 tokens/sec
Plus KV cache read/write overhead

KV Cache Optimization

# KV cache uses significant memory
# Sequence length 2048, batch 16, 32 layers, hidden 4096 each
# KV cache size = 2 × 16 × 32 × 2048 × 4096 × 2 bytes = 8.6 GB

# Use PagedAttention (vLLM)
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

Speculative Decoding

Use small model to "guess" multiple tokens, large model verifies:

Traditional: Large model generates 1 token → verify → generate next → ...

Speculative:
1. Small model quickly generates 4 tokens
2. Large model verifies all 4 at once
3. If 3 are correct, saves 2 large model calls

Summary

ML Inference Challenges

GPU computation is fast, but:
1. CPU-GPU data transfer is slow
2. CPU preprocessing is slow
3. Small batches can't fully utilize GPU

Optimization Strategies

Problem	Solution
Low GPU utilization	Batching, dynamic batching
Slow data transfer	Pinned memory, CUDA streams
Model too large	Quantization (FP16/INT8), distillation
Preprocessing bottleneck	DALI (GPU preprocessing)
Framework overhead	TensorRT, ONNX Runtime

LLM Special Optimizations

KV cache management (PagedAttention)
Speculative decoding
Continuous batching

Tools

Profiling: Nsight Systems, PyTorch Profiler
Runtime: TensorRT, ONNX Runtime, vLLM
Serving: Triton Inference Server

Remember

Theoretical TFLOPS ≠ actual performance

Real bottlenecks are usually:
1. Data movement
2. Memory bandwidth
3. Software overhead

Optimization order:
Pipeline → Batching → Quantization → Model architecture

Performance and Benchmarking