Chapter 23: Evolution of Performance Metrics
Part VII: AI/HPC
"The metrics you optimize for determine the systems you build." — Anonymous
The Presentation That Fell Flat
Marcus had spent three weeks benchmarking the company's new AI accelerator. His presentation to the executive team was packed with data: cache hit rates, branch prediction accuracy, instructions per cycle, memory bandwidth utilization. He was proud of the thoroughness.
The VP of Engineering interrupted five minutes in. "Marcus, this is great detail, but I need one number. How does this compare to the A100 we're currently using?"
Marcus pulled up his IPC comparison chart. "As you can see, our chip achieves 4.2 IPC compared to—"
"IPC?" The VP frowned. "Nobody talks about IPC for AI workloads. What's our TFLOPS? What's the tokens per second for Llama inference?"
Marcus stared at his slides. He'd spent weeks measuring the wrong things.
That evening, Marcus called his former colleague Sarah, now at a leading AI chip startup. "I feel like an idiot," he admitted. "I've been doing CPU performance analysis for fifteen years. When did everything change?"
Sarah laughed sympathetically. "It's not just you. The entire industry went through a metrics revolution. The numbers that mattered for compiling code and running databases are almost irrelevant for training transformers. Let me walk you through what happened."
From IPC to TOPS
Twenty years ago, the core metric for evaluating CPU performance was IPC (Instructions Per Cycle). Engineers used it to compare efficiency across different microarchitectures. A higher IPC meant the processor could execute more instructions in the same amount of time—a clear indicator of "better."
But Marcus discovered what many engineers learn the hard way: metrics that worked for one era can be meaningless in another.
Today, if you ask an AI engineer "what's the IPC of this GPU," they'd look at you the way a race car driver would look at someone asking about their vehicle's cup holder capacity. It's not wrong, exactly—it's just irrelevant.
Modern AI/HPC uses completely different metrics. Where CPUs were measured in instructions, AI accelerators are measured in operations:
| Era | Primary Metric | What It Measures |
|---|---|---|
| 1990s-2000s | IPC (Instructions Per Cycle) | CPU efficiency for general-purpose code |
| 2010s | GFLOPS (Billion FP ops/sec) | GPU compute for graphics and early ML |
| 2020s | TFLOPS/TOPS | AI accelerator throughput at various precisions |
TFLOPS (Tera Floating-point Operations Per Second) has replaced IPC as the primary metric. Furthermore, TOPS (Tera Operations Per Second) describes performance for low-precision operations (INT8, INT4) that dominate AI inference.
Why Do Metrics Evolve?
This change reflects several fundamental shifts in how we compute:
1. Vectorization of Compute Units
Think about what happens when you execute an ADD instruction on a traditional CPU: you add two numbers and get one result. One instruction, one operation.
Now consider a Tensor Core on an NVIDIA H100. A single matrix-multiply-accumulate (MMA) instruction performs 256 FP16 multiply-add operations simultaneously. If you measured this in "instructions per cycle," you'd get a small number—maybe 1 or 2. But in terms of actual useful work for AI, it's doing 256 times more than a CPU instruction.
Traditional CPU:
ADD r1, r2, r3 → 1 addition
Tensor Core (NVIDIA H100):
MMA.F16 → 256 FP16 multiply-add operations
Comparing IPC between these two architectures would be like comparing a delivery truck's "packages per trip" when one truck carries 1 box and another carries 256. The metric doesn't capture what matters.
2. The Memory Wall Changed the Game
When Sarah explained this to Marcus, she drew a simple graph on a napkin. "Compute capability has been growing at roughly 2x every two years," she said. "Memory bandwidth? Maybe 1.3x. After a few decades, compute is 1000x faster while memory is only 10x faster."
This means that for many workloads, the processor spends most of its time waiting for data. Measuring "instructions per second" becomes meaningless when the bottleneck is "bytes per second." The Roofline model, which we'll explore shortly, captures this reality.
3. Workloads Became More Homogeneous
Traditional CPU workloads are diverse: branching code, pointer chasing, string manipulation, system calls. Every program is different, so a general metric like IPC made sense.
AI workloads are remarkably similar at their core: they're dominated by matrix multiplication (GEMM). Whether you're training a vision model, a language model, or a recommendation system, 80-95% of the compute is matrix multiply. For such homogeneous workloads, measuring FLOPS directly is far more meaningful than counting abstract "instructions."
The New Metrics Landscape
Here's how the transition from traditional to modern metrics looks across different dimensions:
| Traditional Metric | Modern Metric | Why It Changed |
|---|---|---|
| CPI / IPC | FLOPS / TOPS | Single instruction now does hundreds of ops |
| Memory Bandwidth | Roofline Model | Compute/bandwidth ratio determines bottleneck |
| Amdahl's Law | Comm-to-Compute Ratio | Network, not serial code, limits scaling |
| Latency vs. Throughput | TTFT vs. TPS | LLM streaming creates new user experience |
| Power (W) | Energy Efficiency (GFLOPS/W) | Electricity is now a major cost center |
Let's explore each evolution in depth, starting with the most fundamental: the shift from counting instructions to counting operations.
CPI/IPC → FLOPS/TOPS
Traditional: CPI and IPC
CPI (Cycles Per Instruction) and IPC (Instructions Per Cycle) are reciprocal metrics:
CPI = Total_Cycles / Total_Instructions
IPC = Total_Instructions / Total_Cycles = 1 / CPI
These metrics assume:
- Each instruction has roughly equal "value"
- Program performance is proportional to instruction count
- Microarchitecture improvements are reflected in IPC gains
Modern: FLOPS and TOPS
FLOPS (Floating-point Operations Per Second) directly measures compute capability:
GFLOPS = Floating_Point_Operations / Time / 10^9
TFLOPS = GFLOPS / 1000
TOPS (Tera Operations Per Second) is used for integer operations, especially low-precision AI inference:
TOPS = Operations / Time / 10^12
TOPS at Different Precisions
Modern AI accelerators support multiple precisions, each with different TOPS:
NVIDIA H100 Theoretical Peak:
┌────────────┬──────────────┐
│ Precision │ TOPS │
├────────────┼──────────────┤
│ FP64 │ 67 TFLOPS │
│ FP32 │ 134 TFLOPS │
│ TF32 │ 989 TFLOPS │
│ FP16 │ 1979 TFLOPS │
│ FP8 │ 3958 TFLOPS │
│ INT8 │ 3958 TOPS │
└────────────┴──────────────┘
Note: These are theoretical peaks. Actual performance is typically 50-80% of peak, depending on workload characteristics.
Peak vs Sustained
When reporting FLOPS, you must distinguish:
- Peak FLOPS: Theoretical maximum, assuming 100% utilization
- Sustained FLOPS: Performance maintainable under actual workloads
Typical ratios:
Matrix multiply (GEMM): 70-95% of peak
Convolution (Conv): 50-80% of peak
Attention mechanism: 30-70% of peak
Memory-intensive ops: 10-30% of peak
Roofline diagram construction:
Performance (GFLOPS)
^
| __________________ Peak Compute (roof)
| /
| / ◄─── Compute Bound Region
| / (Horizontal = at compute ceiling)
| /
| / Ridge Point
| /
| /
| / ◄─── Memory Bound Region
| / Performance scales linearly with AI
| / (Slope = memory bandwidth)
| /
| /
| /
| /
| /
|────┴─────────────────────────────────────────> Arithmetic Intensity
(FLOPs/Byte)
Interpreting Roofline
Two "roofs":
- Horizontal line: Peak Compute (compute ceiling)
- Diagonal line: Memory Bandwidth × AI (memory ceiling)
Whichever line an application falls below is its bottleneck:
AI (FLOPs/Byte) Bottleneck Typical Applications
──────────────────────────────────────────
< 10 Memory Vector add, STREAM
10-50 Boundary Sparse matrix, Conv2D
50-200 Boundary/Compute Dense matrix multiply
> 200 Compute Highly optimized GEMM
Amdahl's Law → Communication-to-Computation
When Marcus first learned parallel programming, Amdahl's Law was gospel. "If 10% of your code is sequential," his professor had said, "you can never get more than 10x speedup, no matter how many processors you throw at it."
That mental model worked fine for multi-core CPUs. But when Marcus started working on distributed AI training across hundreds of GPUs, he discovered a new bottleneck that Amdahl never considered: the network.
Traditional: Amdahl's Law
Amdahl's Law describes the theoretical speedup limit of parallelization:
Speedup = 1 / ((1 - P) + P/N)
Where:
P = parallelizable fraction
N = number of processors
If 95% of your code is parallelizable (P = 0.95), then even with infinite processors, your speedup is limited to 1 / 0.05 = 20x. The sequential 5% becomes the ceiling.
This law assumes that parallel work is truly parallel—that processors can work independently without coordination. In a shared-memory multi-core system, this is approximately true.
Modern: Communication-to-Computation Ratio
In distributed AI training, a new bottleneck emerges: communication.
Typical data-parallel training:
GPU 0 ─┬── Forward ──┬── Backward ──┬── AllReduce ──┐
GPU 1 ─┤ │ │ │
GPU 2 ─┤ │ │ │
GPU 3 ─┘ │ │ │
▼ ▼ ▼
Compute Compute Communication
AllReduce operations need to synchronize gradients across all GPUs—this is the main communication bottleneck.
Communication-to-Computation Ratio
C2C Ratio = Communication_Time / Computation_Time
Ideal: C2C << 1 (communication time much less than compute time)
Reality: C2C worsens as GPU count increases
Influencing factors:
- Model size: Larger gradients mean more data to transfer
- Batch size: Larger batches increase compute time, improving C2C
- Network bandwidth: InfiniBand vs Ethernet makes huge difference
- AllReduce algorithm: Ring AllReduce, Hierarchical AllReduce
Latency vs. Throughput → TTFT vs. TPS
The traditional latency/throughput trade-off still exists in AI, but LLMs introduced something new: the user experience of streaming.
When you chat with an LLM, you don't wait for the entire response to be generated before seeing anything. The words appear one by one, like watching someone type in real-time. This streaming experience created entirely new metrics that capture what users actually care about.
Traditional: Latency and Throughput
For a traditional web service, performance is simple:
- Latency: How long until the response is complete?
- Throughput: How many requests can we handle per second?
You optimize for one, the other, or some balance. A user either sees the result or doesn't.
Modern: LLM's TTFT and TPS
LLMs broke this model because users experience the response progressively. A 10-second response feels fast if words start appearing immediately. The same 10-second response feels slow if there's a 3-second pause before anything appears.
This led to two new metrics that capture different aspects of user experience:
TTFT (Time To First Token): How long until the user sees something?
This is the "perceived responsiveness" metric. Users are more tolerant of slow generation if the response starts quickly. TTFT is dominated by the Prefill phase—processing the entire input prompt before any output can begin.
TPS (Tokens Per Second): How fast do subsequent words appear?
TPS = Tokens generated per second (Decode phase)
Factors affecting TPS:
1. KV Cache size
2. Batch size
3. Memory bandwidth (usually the bottleneck)
Two Phases of LLM Inference
Request processing flow:
[Input Prompt] ──► [Prefill] ──► [Decode × N] ──► [Complete]
▼ ▼
TTFT TPOT × N
Where:
TTFT = Time To First Token
TPOT = Time Per Output Token
N = number of output tokens
Total latency = TTFT + (N × TPOT)
Prefill vs Decode Characteristics
Phase Characteristics Bottleneck
─────────────────────────────────────────
Prefill Parallel process Compute-bound
prompt Uses Tensor Cores
Compute many positions at once
Decode Autoregressive Memory-bound
generation Needs to read KV Cache
Process 1 token at a time
This explains why:
- TTFT increases with prompt length
- TPS is almost unaffected by prompt length
- Batch processing can significantly improve overall throughput
Power → Energy Efficiency
Traditional: Power (Watts)
Early system evaluation treated power as a "constraint" rather than a "metric":
Traditional thinking:
"This CPU draws 100W, ensure adequate cooling"
"Server room power capacity is XX kW"
Power was seen as a problem to "handle," not a target to "optimize."
Modern: Energy Efficiency
In two extreme scenarios, energy efficiency becomes a core metric:
1. Hyperscale Data Centers
Training GPT-4 scale models:
- Thousands of GPUs
- Megawatts of power consumption
- Electricity becomes major cost
Efficiency metrics: GFLOPS/W, TOPS/W
2. Edge Devices
AI inference on phones/IoT:
- Limited battery capacity
- Thermal Design Power (TDP) limits
- User experience affected by heat
Efficiency metrics: Inferences/mAh, TOPS/W
Epilogue: Marcus's Second Presentation
Three weeks later, Marcus gave his second presentation to the executive team. This time, his slides told a different story:
"Our chip achieves 847 TFLOPS at FP16, putting it between the A100 and H100. For Llama-70B inference at batch size 1, we measure 23 tokens per second—competitive with the A100."
He showed a Roofline diagram. "We're currently memory-bound for decode-heavy workloads, achieving 78% of theoretical bandwidth. For prefill-heavy workloads, we hit 65% of peak compute."
The VP nodded. "Now I understand what we're buying. Good work."
After the meeting, Sarah texted him: "Heard it went well. What changed?"
Marcus replied: "I stopped measuring what's easy and started measuring what matters."
That's the first lesson of AI/HPC performance analysis: know which metrics matter for your workload, because the right metric is worth more than a thousand benchmarks.
Summary
Performance metric evolution reflects fundamental changes in computing workloads:
From Single to Multi-dimensional
- Traditional: Single IPC or MHz could explain performance
- Modern: Need combination of metrics (FLOPS, bandwidth, efficiency, latency)
From Absolute to Relative
- Traditional: Pursue highest absolute performance
- Modern: Pursue best "ratios" (Roofline, efficiency)
From Hardware to Application
- Traditional: Hardware specs determine performance
- Modern: Application characteristics (AI, C2C, TTFT) determine evaluation approach
Key Metric Evolution
- IPC → FLOPS/TOPS: Vectorized computation
- Bandwidth → Roofline: Compute/bandwidth ratio
- Amdahl → C2C: Communication becomes new bottleneck
- Latency → TTFT: LLM-specific metrics
- Power → Efficiency: Performance per watt
- Code Size → Quantization: Reduce data movement