Chapter 24: AI/ML Benchmarks
Part VII: AI/HPC
"In machine learning, the only benchmark that matters is your production workload." — Unknown
The Benchmark That Proved Nothing
Linda was evaluating AI accelerator chips for her company's inference infrastructure. Vendor A claimed "500 TOPS," Vendor B claimed "450 TOPS." Easy decision, right? Go with the bigger number.
She ran her actual workload—a BERT-based text classifier—on both chips. Vendor B, with the "smaller" TOPS number, was 40% faster.
"How is this possible?" she asked Vendor A's sales engineer.
He shifted uncomfortably. "Well, our 500 TOPS is at INT4 precision. Your model uses FP16. And our number is for batch size 256—you're running batch size 1. Also, that's peak theoretical throughput. Sustained performance depends on..."
Linda cut him off. "So that number on your datasheet is essentially meaningless for my use case?"
"I wouldn't say meaningless..."
This is the fundamental problem with AI benchmarks. Unlike traditional software where "faster" has a clear meaning, AI performance depends on model architecture, precision, batch size, optimization level, and a dozen other factors. Two chips with identical specs can have 3x performance differences on real workloads.
Why AI Benchmarking Is Different
AI performance evaluation is more complex than traditional software for three fundamental reasons:
1. "Correct" Is Not Binary
When a sorting algorithm produces [1, 3, 2, 4, 5], it's wrong. Period. But when an image classifier achieves 76.3% accuracy instead of 78.1%, is that acceptable? It depends on the application, the cost savings, the latency requirements. Traditional benchmarks measure speed at fixed correctness. AI benchmarks must navigate a speed-accuracy trade-off.
2. Performance and Accuracy Are Intertwined
Quantization—running a model at lower precision—makes inference faster but slightly less accurate:
| Precision | Relative Speed | Typical Accuracy Loss |
|---|---|---|
| FP32 | 1.0x | 0% (baseline) |
| FP16 | ~2.0x | -0.1% |
| INT8 | ~2.5x | -0.5% |
| INT4 | ~4.0x | -2.0% |
A benchmark that only reports speed is incomplete. A benchmark that only reports accuracy ignores practical constraints.
3. Hardware Diversity Creates Comparison Nightmares
The same PyTorch model can run on CPUs, NVIDIA GPUs, AMD GPUs, Google TPUs, Apple Neural Engine, Qualcomm NPUs, Intel Gaudi, and dozens of custom ASICs. Each platform has different optimization paths, different supported operations, different precision formats. Comparing "apples to apples" requires carefully controlled methodology.
MLPerf: The Industry Standard
The industry's answer to the AI benchmarking problem is MLPerf, maintained by MLCommons—a consortium including NVIDIA, Google, Intel, AMD, Meta, Microsoft, and dozens of other companies. MLPerf attempts to create standardized, reproducible benchmarks that allow meaningful comparisons across different hardware platforms.
Think of MLPerf as the SPEC CPU of the AI world: a standardized suite with strict rules about what you can and cannot change.
The MLPerf Family
MLPerf isn't a single benchmark—it's a family of benchmarks targeting different scenarios:
| Benchmark | What It Measures | Typical Submitters |
|---|---|---|
| MLPerf Training | Time to train to target accuracy | NVIDIA, Google, Intel, hyperscalers |
| MLPerf Inference | Inference latency and throughput | AI chip startups, cloud providers |
| MLPerf HPC | ML on supercomputers | National labs, research institutions |
| MLPerf Tiny | Performance on microcontrollers | Embedded chip vendors |
| MLPerf Mobile | Performance on phones/tablets | Qualcomm, MediaTek, Apple |
| MLPerf Storage | Data pipeline performance | Storage vendors |
For most readers of this book, Training and Inference are the benchmarks you'll encounter most often.
MLPerf Training: Racing to Accuracy
The Training benchmark measures one thing: how long does it take to train a model from random initialization to a specified target accuracy?
Benchmark Model Dataset Target Accuracy
─────────────────────────────────────────────────────────────────
ResNet ResNet-50 v1.5 ImageNet 75.9% Top-1
RetinaNet RetinaNet COCO 34.0% mAP
BERT BERT-Large Wikipedia 0.72 F1
DLRM DLRM Criteo 0.8025 AUC
3D U-Net 3D U-Net KiTS19 0.908 Mean Dice
GPT-3 GPT-3 175B C4 2.69 log perplexity
Stable Diffusion Stable Diffusion LAION-400M 10.0 FID
Result Format
Typical result:
System: 8x NVIDIA H100 SXM5
Benchmark: BERT-Large Training
Time: 2.3 minutes (to reach 0.72 F1)
Comparison:
DGX H100 (8 GPU): 2.3 min
DGX A100 (8 GPU): 5.1 min
Cloud TPU v4 (16): 3.8 min
Closed vs Open Division
Closed Division:
- Must use specified model architecture
- Must achieve specified accuracy
- Can only adjust batch size, learning rate, etc.
- Purpose: Fair hardware comparison
Open Division:
- Can modify model architecture
- Can use different optimization techniques
- Purpose: Showcase innovative methods
MLPerf Inference
Measures "inference performance in deployment scenarios."
Scenario Definitions
Scenario Description Primary Metric
─────────────────────────────────────────────────────────
Server Concurrent requests, QPS (within latency SLO)
latency constraints
Offline Batch processing, Throughput (samples/sec)
no latency constraints
SingleStream One request at a time Latency (ms)
MultiStream Multiple independent Number of streams
streams
Server Scenario Details
Server scenario simulates real services:
Request arrives → Queue → Process → Response
↑
Latency constraint
SLO (Service Level Objective):
- Example: 99% of requests must complete within 15ms
### Measurement Items
```text
1. GEMM (General Matrix Multiplication)
- Matrix sizes: from small (256×256) to large (4096×4096)
- Precision: FP32, FP16, INT8
- Simulates: Fully connected layers, Attention
2. Convolution
- Various kernel sizes (1×1, 3×3, 5×5)
- Stride, padding variations
- Simulates: CNN convolution layers
3. RNN (Recurrent Neural Networks)
- LSTM, GRU
- Different hidden sizes and sequence lengths
- Simulates: Sequence models
4. All-Reduce
- Measures distributed training communication performance
- Different data sizes and GPU counts
Typical Results
NVIDIA A100 DeepBench Results:
GEMM (4096×4096, FP16):
Peak: 312 TFLOPS
Achieved: 285 TFLOPS (91.3%)
Conv2D (3×3, 256 channels):
Peak: 312 TFLOPS
Achieved: 198 TFLOPS (63.5%)
Reason: Convolution requires more memory access, cannot achieve pure GEMM efficiency
Other AI Benchmarks
AI Benchmark (ETH Zürich)
AI performance testing designed specifically for mobile devices:
Features:
- Targets phone NPUs and GPUs
- Covers multiple AI tasks
- Has Android App for direct testing
Test items:
1. Image Classification (MobileNet, EfficientNet)
2. Object Detection (YOLO, SSD)
3. Image Segmentation
4. Face Recognition
5. Super Resolution
6. Language Models
Result format:
Total score + individual scores
Comparable with other devices
DAWNBench
Developed by Stanford, focuses on "cost to train to target accuracy":
Core metrics:
Time-to-Accuracy
Cost-to-Accuracy ($)
Example:
"How much time/money to reach 93% Top-5 accuracy on ImageNet?"
This metric is more practical:
- Not just speed, but also cost
- Considers cloud computing pricing
Geekbench ML
Cross-platform consumer AI benchmark:
Pros:
- Easy to run (download app)
- Cross-platform (Windows, macOS, iOS, Android)
- Result database for easy comparison
Cons:
- Not transparent (doesn't disclose all details)
- May be targeted for optimization
- Not suitable for serious performance analysis
Running MLPerf: Practical Guide
Environment Setup
# 1. Get MLPerf code
git clone https://github.com/mlcommons/inference.git
cd inference
# 2. Choose benchmark (ResNet-50 as example)
cd vision/classification_and_detection
# 3. Prepare dataset
# ImageNet validation set (50,000 images)
# Need to download from official source
# 4. Install dependencies
pip install -r requirements.txt
Running Inference Benchmark
# SingleStream scenario (measure latency)
python3 main.py --backend onnxruntime \
--model resnet50 \
--scenario SingleStream \
--accuracy
# Server scenario (measure QPS)
python3 main.py --backend onnxruntime \
--model resnet50 \
--scenario Server \
--qps 100
Interpreting Results
MLPerf Inference Result Example:
TestScenario.SingleStream:
qps: 156.25
latency (ns): 6400000 (6.4 ms)
result summary:
samples processed: 50000
accuracy: 76.15%
target accuracy: 76.46%
Result: 6.4ms latency, 76.15% accuracy (below target, needs adjustment)
Building Your Own AI Benchmark
For specific applications, you may need custom benchmarks:
Design Principles
1. Define clear metrics
- Latency (P50, P95, P99)
- Throughput (samples/sec)
- Accuracy (specific definition)
- Resource usage (memory, power)
2. Reproducibility
- Fix random seeds
- Record complete environment
- Use version control
3. Reflect real workload
- Use actual data distribution
- Simulate real request patterns
- Consider batch mixing
Report Format Suggestion
AI Benchmark Report
═══════════════════════════════════════════════════════════
Model: YOLOv8-Medium
Hardware: NVIDIA RTX 4090
Precision: FP16
Batch Size: 1
Latency Results (1000 iterations):
Mean: 4.2 ms
P50: 4.1 ms
P95: 5.8 ms
P99: 7.2 ms
Std: 0.9 ms
Throughput (60 seconds):
238 images/second
Accuracy (on validation set):
mAP@0.5: 45.2%
mAP@0.5:0.95: 33.1%
GPU Utilization: 85%
GPU Memory: 4.2 GB / 24 GB
Power: 280W average
Environment:
CUDA: 12.2
cuDNN: 8.9
TensorRT: 8.6
Driver: 535.86
OS: Ubuntu 22.04
What Linda Learned
Linda eventually made her chip decision. Neither Vendor A nor Vendor B.
She built her own benchmark using her actual production model, with her actual batch sizes, at her actual precision requirements. She tested latency at P50, P95, and P99. She measured power consumption under load. She calculated cost per inference.
Vendor C, which she'd initially dismissed because of lower published TOPS numbers, turned out to be the best fit for her specific workload—40% lower cost per inference than either A or B.
"The published specs weren't wrong," she explained to her team. "They just weren't measuring what mattered to us. TOPS at INT4 with batch size 256 is a valid metric—it's just not our metric."
The lesson: standardized benchmarks like MLPerf provide valuable apples-to-apples comparisons, but the most important benchmark is always the one that reflects your actual production workload.
Summary
AI/ML Benchmarking has its unique challenges and methods:
Major Benchmarks
- MLPerf: Industry standard, complete models, strict rules
- DeepBench: Core operations, low-level performance analysis
- AI Benchmark: Mobile devices, consumer-oriented
- DAWNBench: Cost-oriented, time/money
Key Considerations
- Performance and accuracy are trade-offs
- Hardware diversity makes comparison difficult
- Need clear measurement conditions
Choosing a Benchmark
- Hardware procurement evaluation: MLPerf
- Low-level optimization analysis: DeepBench
- Consumer comparison: Geekbench ML / AI Benchmark
- Cost-sensitive scenarios: DAWNBench
Custom Benchmarks
- Define clear metrics
- Ensure reproducibility
- Reflect real workload
- Report complete environment