Chapter 24: AI/ML Benchmarks

Part VII: AI/HPC

"In machine learning, the only benchmark that matters is your production workload." — Unknown

The Benchmark That Proved Nothing

Linda was evaluating AI accelerator chips for her company's inference infrastructure. Vendor A claimed "500 TOPS," Vendor B claimed "450 TOPS." Easy decision, right? Go with the bigger number.

She ran her actual workload—a BERT-based text classifier—on both chips. Vendor B, with the "smaller" TOPS number, was 40% faster.

"How is this possible?" she asked Vendor A's sales engineer.

He shifted uncomfortably. "Well, our 500 TOPS is at INT4 precision. Your model uses FP16. And our number is for batch size 256—you're running batch size 1. Also, that's peak theoretical throughput. Sustained performance depends on..."

Linda cut him off. "So that number on your datasheet is essentially meaningless for my use case?"

"I wouldn't say meaningless..."

This is the fundamental problem with AI benchmarks. Unlike traditional software where "faster" has a clear meaning, AI performance depends on model architecture, precision, batch size, optimization level, and a dozen other factors. Two chips with identical specs can have 3x performance differences on real workloads.

Why AI Benchmarking Is Different

AI performance evaluation is more complex than traditional software for three fundamental reasons:

1. "Correct" Is Not Binary

When a sorting algorithm produces [1, 3, 2, 4, 5], it's wrong. Period. But when an image classifier achieves 76.3% accuracy instead of 78.1%, is that acceptable? It depends on the application, the cost savings, the latency requirements. Traditional benchmarks measure speed at fixed correctness. AI benchmarks must navigate a speed-accuracy trade-off.

2. Performance and Accuracy Are Intertwined

Quantization—running a model at lower precision—makes inference faster but slightly less accurate:

Precision	Relative Speed	Typical Accuracy Loss
FP32	1.0x	0% (baseline)
FP16	~2.0x	-0.1%
INT8	~2.5x	-0.5%
INT4	~4.0x	-2.0%

A benchmark that only reports speed is incomplete. A benchmark that only reports accuracy ignores practical constraints.

3. Hardware Diversity Creates Comparison Nightmares

The same PyTorch model can run on CPUs, NVIDIA GPUs, AMD GPUs, Google TPUs, Apple Neural Engine, Qualcomm NPUs, Intel Gaudi, and dozens of custom ASICs. Each platform has different optimization paths, different supported operations, different precision formats. Comparing "apples to apples" requires carefully controlled methodology.

MLPerf: The Industry Standard

The industry's answer to the AI benchmarking problem is MLPerf, maintained by MLCommons—a consortium including NVIDIA, Google, Intel, AMD, Meta, Microsoft, and dozens of other companies. MLPerf attempts to create standardized, reproducible benchmarks that allow meaningful comparisons across different hardware platforms.

Think of MLPerf as the SPEC CPU of the AI world: a standardized suite with strict rules about what you can and cannot change.

The MLPerf Family

MLPerf isn't a single benchmark—it's a family of benchmarks targeting different scenarios:

Benchmark	What It Measures	Typical Submitters
MLPerf Training	Time to train to target accuracy	NVIDIA, Google, Intel, hyperscalers
MLPerf Inference	Inference latency and throughput	AI chip startups, cloud providers
MLPerf HPC	ML on supercomputers	National labs, research institutions
MLPerf Tiny	Performance on microcontrollers	Embedded chip vendors
MLPerf Mobile	Performance on phones/tablets	Qualcomm, MediaTek, Apple
MLPerf Storage	Data pipeline performance	Storage vendors

For most readers of this book, Training and Inference are the benchmarks you'll encounter most often.

MLPerf Training: Racing to Accuracy

The Training benchmark measures one thing: how long does it take to train a model from random initialization to a specified target accuracy?

Benchmark        Model             Dataset         Target Accuracy
─────────────────────────────────────────────────────────────────
ResNet           ResNet-50 v1.5    ImageNet        75.9% Top-1
RetinaNet        RetinaNet         COCO            34.0% mAP
BERT             BERT-Large        Wikipedia       0.72 F1
DLRM             DLRM              Criteo          0.8025 AUC
3D U-Net         3D U-Net          KiTS19          0.908 Mean Dice
GPT-3            GPT-3 175B        C4              2.69 log perplexity
Stable Diffusion Stable Diffusion  LAION-400M     10.0 FID

Result Format

Typical result:

System: 8x NVIDIA H100 SXM5
Benchmark: BERT-Large Training
Time: 2.3 minutes (to reach 0.72 F1)

Comparison:
  DGX H100 (8 GPU):     2.3 min
  DGX A100 (8 GPU):     5.1 min
  Cloud TPU v4 (16):    3.8 min

Closed vs Open Division

Closed Division:
  - Must use specified model architecture
  - Must achieve specified accuracy
  - Can only adjust batch size, learning rate, etc.
  - Purpose: Fair hardware comparison

Open Division:
  - Can modify model architecture
  - Can use different optimization techniques
  - Purpose: Showcase innovative methods

MLPerf Inference

Measures "inference performance in deployment scenarios."

Scenario Definitions

Scenario      Description                 Primary Metric
─────────────────────────────────────────────────────────
Server        Concurrent requests,        QPS (within latency SLO)
              latency constraints
Offline       Batch processing,           Throughput (samples/sec)
              no latency constraints
SingleStream  One request at a time       Latency (ms)
MultiStream   Multiple independent        Number of streams
              streams

Server Scenario Details

Server scenario simulates real services:

Request arrives → Queue → Process → Response
                    ↑
              Latency constraint

SLO (Service Level Objective):
  - Example: 99% of requests must complete within 15ms

### Measurement Items

```text
1. GEMM (General Matrix Multiplication)
   - Matrix sizes: from small (256×256) to large (4096×4096)
   - Precision: FP32, FP16, INT8
   - Simulates: Fully connected layers, Attention

2. Convolution
   - Various kernel sizes (1×1, 3×3, 5×5)
   - Stride, padding variations
   - Simulates: CNN convolution layers

3. RNN (Recurrent Neural Networks)
   - LSTM, GRU
   - Different hidden sizes and sequence lengths
   - Simulates: Sequence models

4. All-Reduce
   - Measures distributed training communication performance
   - Different data sizes and GPU counts

Typical Results

NVIDIA A100 DeepBench Results:

GEMM (4096×4096, FP16):
  Peak:      312 TFLOPS
  Achieved:  285 TFLOPS (91.3%)

Conv2D (3×3, 256 channels):
  Peak:      312 TFLOPS
  Achieved:  198 TFLOPS (63.5%)

Reason: Convolution requires more memory access, cannot achieve pure GEMM efficiency

Other AI Benchmarks

AI Benchmark (ETH Zürich)

AI performance testing designed specifically for mobile devices:

Features:
- Targets phone NPUs and GPUs
- Covers multiple AI tasks
- Has Android App for direct testing

Test items:
1. Image Classification (MobileNet, EfficientNet)
2. Object Detection (YOLO, SSD)
3. Image Segmentation
4. Face Recognition
5. Super Resolution
6. Language Models

Result format:
  Total score + individual scores
  Comparable with other devices

DAWNBench

Developed by Stanford, focuses on "cost to train to target accuracy":

Core metrics:
  Time-to-Accuracy
  Cost-to-Accuracy ($)

Example:
  "How much time/money to reach 93% Top-5 accuracy on ImageNet?"

This metric is more practical:
  - Not just speed, but also cost
  - Considers cloud computing pricing

Geekbench ML

Cross-platform consumer AI benchmark:

Pros:
- Easy to run (download app)
- Cross-platform (Windows, macOS, iOS, Android)
- Result database for easy comparison

Cons:
- Not transparent (doesn't disclose all details)
- May be targeted for optimization
- Not suitable for serious performance analysis

Running MLPerf: Practical Guide

Environment Setup

# 1. Get MLPerf code
git clone https://github.com/mlcommons/inference.git
cd inference

# 2. Choose benchmark (ResNet-50 as example)
cd vision/classification_and_detection

# 3. Prepare dataset
# ImageNet validation set (50,000 images)
# Need to download from official source

# 4. Install dependencies
pip install -r requirements.txt

Running Inference Benchmark

# SingleStream scenario (measure latency)
python3 main.py --backend onnxruntime \
    --model resnet50 \
    --scenario SingleStream \
    --accuracy

# Server scenario (measure QPS)
python3 main.py --backend onnxruntime \
    --model resnet50 \
    --scenario Server \
    --qps 100

Interpreting Results

MLPerf Inference Result Example:

TestScenario.SingleStream:
  qps: 156.25
  latency (ns): 6400000 (6.4 ms)

result summary:
  samples processed: 50000
  accuracy: 76.15%
  target accuracy: 76.46%

Result: 6.4ms latency, 76.15% accuracy (below target, needs adjustment)

Building Your Own AI Benchmark

For specific applications, you may need custom benchmarks:

Design Principles

1. Define clear metrics
   - Latency (P50, P95, P99)
   - Throughput (samples/sec)
   - Accuracy (specific definition)
   - Resource usage (memory, power)

2. Reproducibility
   - Fix random seeds
   - Record complete environment
   - Use version control

3. Reflect real workload
   - Use actual data distribution
   - Simulate real request patterns
   - Consider batch mixing

Report Format Suggestion

AI Benchmark Report
═══════════════════════════════════════════════════════════

Model: YOLOv8-Medium
Hardware: NVIDIA RTX 4090
Precision: FP16
Batch Size: 1

Latency Results (1000 iterations):
  Mean:   4.2 ms
  P50:    4.1 ms
  P95:    5.8 ms
  P99:    7.2 ms
  Std:    0.9 ms

Throughput (60 seconds):
  238 images/second

Accuracy (on validation set):
  mAP@0.5: 45.2%
  mAP@0.5:0.95: 33.1%

GPU Utilization: 85%
GPU Memory: 4.2 GB / 24 GB
Power: 280W average

Environment:
  CUDA: 12.2
  cuDNN: 8.9
  TensorRT: 8.6
  Driver: 535.86
  OS: Ubuntu 22.04

What Linda Learned

Linda eventually made her chip decision. Neither Vendor A nor Vendor B.

She built her own benchmark using her actual production model, with her actual batch sizes, at her actual precision requirements. She tested latency at P50, P95, and P99. She measured power consumption under load. She calculated cost per inference.

Vendor C, which she'd initially dismissed because of lower published TOPS numbers, turned out to be the best fit for her specific workload—40% lower cost per inference than either A or B.

"The published specs weren't wrong," she explained to her team. "They just weren't measuring what mattered to us. TOPS at INT4 with batch size 256 is a valid metric—it's just not our metric."

The lesson: standardized benchmarks like MLPerf provide valuable apples-to-apples comparisons, but the most important benchmark is always the one that reflects your actual production workload.

Summary

AI/ML Benchmarking has its unique challenges and methods:

Major Benchmarks

MLPerf: Industry standard, complete models, strict rules
DeepBench: Core operations, low-level performance analysis
AI Benchmark: Mobile devices, consumer-oriented
DAWNBench: Cost-oriented, time/money

Key Considerations

Performance and accuracy are trade-offs
Hardware diversity makes comparison difficult
Need clear measurement conditions

Choosing a Benchmark

Hardware procurement evaluation: MLPerf
Low-level optimization analysis: DeepBench
Consumer comparison: Geekbench ML / AI Benchmark
Cost-sensitive scenarios: DAWNBench

Custom Benchmarks

Define clear metrics
Ensure reproducibility
Reflect real workload
Report complete environment

Performance and Benchmarking