Chapter 32: Case Study: ML Inference Optimization

Part VIII: Case Studies


"Training is science. Inference is engineering." — An ML Engineer

The Story of "GPU Utilization at Only 3%"

We deployed an image classification service using ResNet-50. Hardware was NVIDIA A100 GPU—theoretically 312 TFLOPS (TF32).

But measured: only 50 images per second.

A100 processing one image requires about 8 GFLOPs. Theoretically:

312 TFLOPS / 8 GFLOPs = 39,000 images/sec

We achieved only 50 images/sec.

GPU utilization: 0.13%

Where's the problem?

1. Image transfer from CPU to GPU: ~5ms
2. GPU computation: ~0.2ms
3. Result transfer from GPU to CPU: ~0.1ms
4. Python overhead: ~10ms
5. Image decoding (CPU): ~5ms
────────────────────────────────────
Total: ~20ms per image

GPU computation is only 1%. The other 99% is spent "feeding data."

This is the core challenge of ML inference optimization: keeping the GPU busy.

ML Inference Characteristics

Training vs Inference

CharacteristicTrainingInference
Batch sizeLarge (32-4096)Small (1-64)
Latency requirementNot importantCritical
Precision requirementFP32/BF16Can be lower (INT8)
FrequencyOnceContinuous
Optimization goalThroughputLatency + Throughput

Common Bottlenecks

┌─────────────────────────────────────────────────────────────────┐
│                    Inference Pipeline                           │
├─────────┬──────────┬──────────┬──────────┬──────────┬──────────┤
│  Input  │ Preproc  │ H2D Copy │  Compute │ D2H Copy │ Postproc │
│  (I/O)  │  (CPU)   │  (PCIe)  │  (GPU)   │  (PCIe)  │  (CPU)   │
└─────────┴──────────┴──────────┴──────────┴──────────┴──────────┘

Common bottlenecks:
1. I/O: Reading data
2. CPU preprocessing: Decode, resize, normalize
3. PCIe transfer: CPU ↔ GPU
4. GPU compute: Model inference
5. CPU postprocessing: NMS, decoding

Measurement Tools

NVIDIA Nsight Systems

# Record profile
nsys profile -o report python inference.py

# View report
nsys-ui report.nsys-rep

Nsight Systems shows:

  • GPU kernel execution time
  • CPU/GPU synchronization points
  • Memory copy (H2D/D2H)
  • CUDA API calls

PyTorch Profiler

import torch
from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for i in range(100):
        with record_function("inference"):
            output = model(input_tensor)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Optimization Strategies

Optimization 1: Batching

Single inference cannot fully utilize GPU's parallel capability:

# Original: process one by one
for image in images:
    result = model(image.unsqueeze(0))  # batch_size = 1

# Optimized: batch processing
batch = torch.stack(images)  # batch_size = 32
results = model(batch)

Batching effect:

Batch Size    Latency (ms)    Throughput (img/s)    GPU Util
─────────────────────────────────────────────────────────────
    1              5.2              192               15%
    8              8.1              988               45%
   32             18.5             1730               78%
  128             62.3             2055               92%
  256            118.7             2157               95%

Batch size increases, throughput increases, but latency also increases.

Trade-off: Latency vs Throughput

Optimization 2: Reduce CPU-GPU Data Transfer

Use Pinned Memory

# Normal memory → GPU: requires extra copy
tensor = torch.randn(batch_size, 3, 224, 224)
tensor_gpu = tensor.to('cuda')  # slow


### Optimization 4: Quantization

**FP32 → FP16**

```python
# PyTorch automatic mixed precision
model = model.half()  # Convert to FP16
input_tensor = input_tensor.half()
output = model(input_tensor)

FP32 → INT8 (requires calibration)

import torch.quantization as quant

# Prepare for quantization
model.qconfig = quant.get_default_qconfig('fbgemm')
quant.prepare(model, inplace=True)

# Calibrate (with representative data)
with torch.no_grad():
    for data in calibration_loader:
        model(data)

# Convert
quant.convert(model, inplace=True)

Quantization effect:

Precision    Model Size    Latency    Accuracy Drop
───────────────────────────────────────────────────
FP32         98 MB         5.2 ms     baseline
FP16         49 MB         2.8 ms     ~0%
INT8         25 MB         1.5 ms     0.5-1%

Optimization 5: Preprocessing Optimization

CPU preprocessing is often the bottleneck:

# Original: PIL + torchvision (slow)
from PIL import Image
from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

image = Image.open("image.jpg")
tensor = transform(image)

Use NVIDIA DALI (GPU preprocessing)

from nvidia.dali import pipeline_def, fn
import nvidia.dali.types as types

@pipeline_def
def image_pipeline():
    jpegs, labels = fn.readers.file(file_root="images/")
    images = fn.decoders.image(jpegs, device="mixed")  # GPU decode
    images = fn.resize(images, size=[256, 256])
    images = fn.crop(images, crop=[224, 224])
    images = fn.normalize(images,
                         mean=[0.485*255, 0.456*255, 0.406*255],
                         std=[0.229*255, 0.224*255, 0.225*255])
    return images, labels

pipe = image_pipeline(batch_size=32, num_threads=4, device_id=0)
pipe.build()

LLM Inference Special Optimizations

Large language models have unique challenges:

Memory-bound Problem

LLaMA-7B parameters: 7B × 2 bytes (FP16) = 14 GB
A100 memory bandwidth: 2 TB/s
Each token generation requires reading all parameters

Theoretical max speed: 2000 GB/s ÷ 14 GB = 143 tokens/sec
Plus KV cache read/write overhead

KV Cache Optimization

# KV cache uses significant memory
# Sequence length 2048, batch 16, 32 layers, hidden 4096 each
# KV cache size = 2 × 16 × 32 × 2048 × 4096 × 2 bytes = 8.6 GB

# Use PagedAttention (vLLM)
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

Speculative Decoding

Use small model to "guess" multiple tokens, large model verifies:

Traditional: Large model generates 1 token → verify → generate next → ...

Speculative:
1. Small model quickly generates 4 tokens
2. Large model verifies all 4 at once
3. If 3 are correct, saves 2 large model calls

Summary

ML Inference Challenges

GPU computation is fast, but:
1. CPU-GPU data transfer is slow
2. CPU preprocessing is slow
3. Small batches can't fully utilize GPU

Optimization Strategies

ProblemSolution
Low GPU utilizationBatching, dynamic batching
Slow data transferPinned memory, CUDA streams
Model too largeQuantization (FP16/INT8), distillation
Preprocessing bottleneckDALI (GPU preprocessing)
Framework overheadTensorRT, ONNX Runtime

LLM Special Optimizations

  • KV cache management (PagedAttention)
  • Speculative decoding
  • Continuous batching

Tools

  • Profiling: Nsight Systems, PyTorch Profiler
  • Runtime: TensorRT, ONNX Runtime, vLLM
  • Serving: Triton Inference Server

Remember

Theoretical TFLOPS ≠ actual performance

Real bottlenecks are usually:
1. Data movement
2. Memory bandwidth
3. Software overhead

Optimization order:
Pipeline → Batching → Quantization → Model architecture