Chapter 23: Evolution of Performance Metrics

Part VII: AI/HPC

"The metrics you optimize for determine the systems you build." — Anonymous

The Presentation That Fell Flat

Marcus had spent three weeks benchmarking the company's new AI accelerator. His presentation to the executive team was packed with data: cache hit rates, branch prediction accuracy, instructions per cycle, memory bandwidth utilization. He was proud of the thoroughness.

The VP of Engineering interrupted five minutes in. "Marcus, this is great detail, but I need one number. How does this compare to the A100 we're currently using?"

Marcus pulled up his IPC comparison chart. "As you can see, our chip achieves 4.2 IPC compared to—"

"IPC?" The VP frowned. "Nobody talks about IPC for AI workloads. What's our TFLOPS? What's the tokens per second for Llama inference?"

Marcus stared at his slides. He'd spent weeks measuring the wrong things.

That evening, Marcus called his former colleague Sarah, now at a leading AI chip startup. "I feel like an idiot," he admitted. "I've been doing CPU performance analysis for fifteen years. When did everything change?"

Sarah laughed sympathetically. "It's not just you. The entire industry went through a metrics revolution. The numbers that mattered for compiling code and running databases are almost irrelevant for training transformers. Let me walk you through what happened."

From IPC to TOPS

Twenty years ago, the core metric for evaluating CPU performance was IPC (Instructions Per Cycle). Engineers used it to compare efficiency across different microarchitectures. A higher IPC meant the processor could execute more instructions in the same amount of time—a clear indicator of "better."

But Marcus discovered what many engineers learn the hard way: metrics that worked for one era can be meaningless in another.

Today, if you ask an AI engineer "what's the IPC of this GPU," they'd look at you the way a race car driver would look at someone asking about their vehicle's cup holder capacity. It's not wrong, exactly—it's just irrelevant.

Modern AI/HPC uses completely different metrics. Where CPUs were measured in instructions, AI accelerators are measured in operations:

Era	Primary Metric	What It Measures
1990s-2000s	IPC (Instructions Per Cycle)	CPU efficiency for general-purpose code
2010s	GFLOPS (Billion FP ops/sec)	GPU compute for graphics and early ML
2020s	TFLOPS/TOPS	AI accelerator throughput at various precisions

TFLOPS (Tera Floating-point Operations Per Second) has replaced IPC as the primary metric. Furthermore, TOPS (Tera Operations Per Second) describes performance for low-precision operations (INT8, INT4) that dominate AI inference.

Why Do Metrics Evolve?

This change reflects several fundamental shifts in how we compute:

1. Vectorization of Compute Units

Think about what happens when you execute an ADD instruction on a traditional CPU: you add two numbers and get one result. One instruction, one operation.

Now consider a Tensor Core on an NVIDIA H100. A single matrix-multiply-accumulate (MMA) instruction performs 256 FP16 multiply-add operations simultaneously. If you measured this in "instructions per cycle," you'd get a small number—maybe 1 or 2. But in terms of actual useful work for AI, it's doing 256 times more than a CPU instruction.

Traditional CPU:
  ADD r1, r2, r3    →  1 addition

Tensor Core (NVIDIA H100):
  MMA.F16           →  256 FP16 multiply-add operations

Comparing IPC between these two architectures would be like comparing a delivery truck's "packages per trip" when one truck carries 1 box and another carries 256. The metric doesn't capture what matters.

2. The Memory Wall Changed the Game

When Sarah explained this to Marcus, she drew a simple graph on a napkin. "Compute capability has been growing at roughly 2x every two years," she said. "Memory bandwidth? Maybe 1.3x. After a few decades, compute is 1000x faster while memory is only 10x faster."

This means that for many workloads, the processor spends most of its time waiting for data. Measuring "instructions per second" becomes meaningless when the bottleneck is "bytes per second." The Roofline model, which we'll explore shortly, captures this reality.

3. Workloads Became More Homogeneous

Traditional CPU workloads are diverse: branching code, pointer chasing, string manipulation, system calls. Every program is different, so a general metric like IPC made sense.

AI workloads are remarkably similar at their core: they're dominated by matrix multiplication (GEMM). Whether you're training a vision model, a language model, or a recommendation system, 80-95% of the compute is matrix multiply. For such homogeneous workloads, measuring FLOPS directly is far more meaningful than counting abstract "instructions."

The New Metrics Landscape

Here's how the transition from traditional to modern metrics looks across different dimensions:

Traditional Metric	Modern Metric	Why It Changed
CPI / IPC	FLOPS / TOPS	Single instruction now does hundreds of ops
Memory Bandwidth	Roofline Model	Compute/bandwidth ratio determines bottleneck
Amdahl's Law	Comm-to-Compute Ratio	Network, not serial code, limits scaling
Latency vs. Throughput	TTFT vs. TPS	LLM streaming creates new user experience
Power (W)	Energy Efficiency (GFLOPS/W)	Electricity is now a major cost center

Let's explore each evolution in depth, starting with the most fundamental: the shift from counting instructions to counting operations.

CPI/IPC → FLOPS/TOPS

Traditional: CPI and IPC

CPI (Cycles Per Instruction) and IPC (Instructions Per Cycle) are reciprocal metrics:

CPI = Total_Cycles / Total_Instructions
IPC = Total_Instructions / Total_Cycles = 1 / CPI

These metrics assume:

Each instruction has roughly equal "value"
Program performance is proportional to instruction count
Microarchitecture improvements are reflected in IPC gains

Modern: FLOPS and TOPS

FLOPS (Floating-point Operations Per Second) directly measures compute capability:

GFLOPS = Floating_Point_Operations / Time / 10^9
TFLOPS = GFLOPS / 1000

TOPS (Tera Operations Per Second) is used for integer operations, especially low-precision AI inference:

TOPS = Operations / Time / 10^12

TOPS at Different Precisions

Modern AI accelerators support multiple precisions, each with different TOPS:

NVIDIA H100 Theoretical Peak:
┌────────────┬──────────────┐
│  Precision │   TOPS       │
├────────────┼──────────────┤
│  FP64      │   67 TFLOPS  │
│  FP32      │  134 TFLOPS  │
│  TF32      │  989 TFLOPS  │
│  FP16      │ 1979 TFLOPS  │
│  FP8       │ 3958 TFLOPS  │
│  INT8      │ 3958 TOPS    │
└────────────┴──────────────┘

Note: These are theoretical peaks. Actual performance is typically 50-80% of peak, depending on workload characteristics.

Peak vs Sustained

When reporting FLOPS, you must distinguish:

Peak FLOPS: Theoretical maximum, assuming 100% utilization
Sustained FLOPS: Performance maintainable under actual workloads

Typical ratios:
  Matrix multiply (GEMM):   70-95% of peak
  Convolution (Conv):       50-80% of peak
  Attention mechanism:      30-70% of peak
  Memory-intensive ops:     10-30% of peak

Roofline diagram construction:

Performance (GFLOPS)
     ^
     |                    __________________ Peak Compute (roof)
     |                   /
     |                  /   ◄─── Compute Bound Region
     |                 /         (Horizontal = at compute ceiling)
     |                /
     |               /  Ridge Point
     |              /
     |             /
     |            /  ◄─── Memory Bound Region
     |           /       Performance scales linearly with AI
     |          /        (Slope = memory bandwidth)
     |         /
     |        /
     |       /
     |      /
     |     /
     |────┴─────────────────────────────────────────> Arithmetic Intensity
                                                       (FLOPs/Byte)

Interpreting Roofline

Two "roofs":

Horizontal line: Peak Compute (compute ceiling)
Diagonal line: Memory Bandwidth × AI (memory ceiling)

Whichever line an application falls below is its bottleneck:

AI (FLOPs/Byte)  Bottleneck      Typical Applications
──────────────────────────────────────────
    < 10         Memory          Vector add, STREAM
   10-50         Boundary        Sparse matrix, Conv2D
   50-200        Boundary/Compute Dense matrix multiply
    > 200        Compute         Highly optimized GEMM

Amdahl's Law → Communication-to-Computation

When Marcus first learned parallel programming, Amdahl's Law was gospel. "If 10% of your code is sequential," his professor had said, "you can never get more than 10x speedup, no matter how many processors you throw at it."

That mental model worked fine for multi-core CPUs. But when Marcus started working on distributed AI training across hundreds of GPUs, he discovered a new bottleneck that Amdahl never considered: the network.

Traditional: Amdahl's Law

Amdahl's Law describes the theoretical speedup limit of parallelization:

Speedup = 1 / ((1 - P) + P/N)

Where:
  P = parallelizable fraction
  N = number of processors

If 95% of your code is parallelizable (P = 0.95), then even with infinite processors, your speedup is limited to 1 / 0.05 = 20x. The sequential 5% becomes the ceiling.

This law assumes that parallel work is truly parallel—that processors can work independently without coordination. In a shared-memory multi-core system, this is approximately true.

Modern: Communication-to-Computation Ratio

In distributed AI training, a new bottleneck emerges: communication.

Typical data-parallel training:

GPU 0 ─┬── Forward ──┬── Backward ──┬── AllReduce ──┐
GPU 1 ─┤             │              │               │
GPU 2 ─┤             │              │               │
GPU 3 ─┘             │              │               │
                     ▼              ▼               ▼
                 Compute         Compute       Communication

AllReduce operations need to synchronize gradients across all GPUs—this is the main communication bottleneck.

Communication-to-Computation Ratio

C2C Ratio = Communication_Time / Computation_Time

Ideal: C2C << 1 (communication time much less than compute time)
Reality: C2C worsens as GPU count increases

Influencing factors:

Model size: Larger gradients mean more data to transfer
Batch size: Larger batches increase compute time, improving C2C
Network bandwidth: InfiniBand vs Ethernet makes huge difference
AllReduce algorithm: Ring AllReduce, Hierarchical AllReduce

Latency vs. Throughput → TTFT vs. TPS

The traditional latency/throughput trade-off still exists in AI, but LLMs introduced something new: the user experience of streaming.

When you chat with an LLM, you don't wait for the entire response to be generated before seeing anything. The words appear one by one, like watching someone type in real-time. This streaming experience created entirely new metrics that capture what users actually care about.

Traditional: Latency and Throughput

For a traditional web service, performance is simple:

Latency: How long until the response is complete?
Throughput: How many requests can we handle per second?

You optimize for one, the other, or some balance. A user either sees the result or doesn't.

Modern: LLM's TTFT and TPS

LLMs broke this model because users experience the response progressively. A 10-second response feels fast if words start appearing immediately. The same 10-second response feels slow if there's a 3-second pause before anything appears.

This led to two new metrics that capture different aspects of user experience:

TTFT (Time To First Token): How long until the user sees something?

This is the "perceived responsiveness" metric. Users are more tolerant of slow generation if the response starts quickly. TTFT is dominated by the Prefill phase—processing the entire input prompt before any output can begin.

TPS (Tokens Per Second): How fast do subsequent words appear?

TPS = Tokens generated per second (Decode phase)

Factors affecting TPS:
1. KV Cache size
2. Batch size
3. Memory bandwidth (usually the bottleneck)

Two Phases of LLM Inference

Request processing flow:

[Input Prompt] ──► [Prefill] ──► [Decode × N] ──► [Complete]
                    ▼               ▼
                 TTFT            TPOT × N

Where:
  TTFT = Time To First Token
  TPOT = Time Per Output Token
  N = number of output tokens

Total latency = TTFT + (N × TPOT)

Prefill vs Decode Characteristics

Phase       Characteristics       Bottleneck
─────────────────────────────────────────
Prefill     Parallel process      Compute-bound
            prompt                Uses Tensor Cores
            Compute many positions at once

Decode      Autoregressive        Memory-bound
            generation            Needs to read KV Cache
            Process 1 token at a time

This explains why:

TTFT increases with prompt length
TPS is almost unaffected by prompt length
Batch processing can significantly improve overall throughput

Power → Energy Efficiency

Traditional: Power (Watts)

Early system evaluation treated power as a "constraint" rather than a "metric":

Traditional thinking:
  "This CPU draws 100W, ensure adequate cooling"
  "Server room power capacity is XX kW"

Power was seen as a problem to "handle," not a target to "optimize."

Modern: Energy Efficiency

In two extreme scenarios, energy efficiency becomes a core metric:

1. Hyperscale Data Centers

Training GPT-4 scale models:
  - Thousands of GPUs
  - Megawatts of power consumption
  - Electricity becomes major cost

Efficiency metrics: GFLOPS/W, TOPS/W

2. Edge Devices

AI inference on phones/IoT:
  - Limited battery capacity
  - Thermal Design Power (TDP) limits
  - User experience affected by heat

Efficiency metrics: Inferences/mAh, TOPS/W

Epilogue: Marcus's Second Presentation

Three weeks later, Marcus gave his second presentation to the executive team. This time, his slides told a different story:

"Our chip achieves 847 TFLOPS at FP16, putting it between the A100 and H100. For Llama-70B inference at batch size 1, we measure 23 tokens per second—competitive with the A100."

He showed a Roofline diagram. "We're currently memory-bound for decode-heavy workloads, achieving 78% of theoretical bandwidth. For prefill-heavy workloads, we hit 65% of peak compute."

The VP nodded. "Now I understand what we're buying. Good work."

After the meeting, Sarah texted him: "Heard it went well. What changed?"

Marcus replied: "I stopped measuring what's easy and started measuring what matters."

That's the first lesson of AI/HPC performance analysis: know which metrics matter for your workload, because the right metric is worth more than a thousand benchmarks.

Summary

Performance metric evolution reflects fundamental changes in computing workloads:

From Single to Multi-dimensional

Traditional: Single IPC or MHz could explain performance
Modern: Need combination of metrics (FLOPS, bandwidth, efficiency, latency)

From Absolute to Relative

Traditional: Pursue highest absolute performance
Modern: Pursue best "ratios" (Roofline, efficiency)

From Hardware to Application

Traditional: Hardware specs determine performance
Modern: Application characteristics (AI, C2C, TTFT) determine evaluation approach

Key Metric Evolution

IPC → FLOPS/TOPS: Vectorized computation
Bandwidth → Roofline: Compute/bandwidth ratio
Amdahl → C2C: Communication becomes new bottleneck
Latency → TTFT: LLM-specific metrics
Power → Efficiency: Performance per watt
Code Size → Quantization: Reduce data movement

Performance and Benchmarking