Chapter 25: HPC Benchmarks
Part VII: AI/HPC
"The TOP500 list is not about who has the biggest computer, but about who can solve the biggest problems." — Jack Dongarra
When Your Supercomputer Ranks #1 But Can't Run Your Code
Dr. Zhang's research group had just gotten access to their national lab's newest supercomputer—ranked in the top 20 of the TOP500 list. Exciting times. Their climate simulation code, which took 48 hours on the previous system, should scream on this new machine.
It didn't. The simulation took 52 hours. Slower than before.
"How is this possible?" Dr. Zhang asked the system administrator. "This machine has 10x the LINPACK score of our old system."
The admin nodded sympathetically. "LINPACK measures dense linear algebra. Your code is sparse matrix-heavy with irregular memory access patterns. This new system has great compute, but the memory bandwidth per core actually went down. Your code is memory-bound, not compute-bound."
Dr. Zhang had just learned one of the fundamental lessons of HPC benchmarking: the benchmark that ranks supercomputers tells you almost nothing about how those supercomputers will perform on your specific workload.
This chapter explores the benchmarks used to evaluate HPC systems, their strengths, their limitations, and how to interpret them for real applications.
A Brief History of HPC Performance Measurement
Since 1993, the TOP500 list has been published twice yearly, ranking the world's most powerful supercomputers. It's become the definitive measure of "who has the biggest computer"—and also a cautionary tale about what benchmarks can and cannot tell you.
The performance growth has been staggering:
| Year | Milestone | FLOPS |
|---|---|---|
| 1993 | First TOP500 list | 59.7 GFLOPS (CM-5, Los Alamos) |
| 1997 | First teraFLOPS | 1.0 TFLOPS (ASCI Red, Intel) |
| 2008 | First petaFLOPS | 1.0 PFLOPS (Roadrunner, IBM) |
| 2022 | First exaFLOPS | 1.1 EFLOPS (Frontier, AMD) |
Over 30 years, peak performance has grown approximately 20 million times. But how is this measured, and what does it actually mean?
LINPACK: The Benchmark That Defines the TOP500
LINPACK (and its modern parallel version, HPL—High Performance LINPACK) is the benchmark used for TOP500 rankings. It measures performance on a specific mathematical operation: solving a dense system of linear equations.
What LINPACK Actually Computes
The problem is deceptively simple:
Solve Ax = b for x
Where:
A is an n×n dense matrix
b is a known vector
x is the unknown we're solving for
The standard approach uses LU decomposition: factorize A into lower and upper triangular matrices (L and U), then solve through forward and backward substitution. The computational complexity is O(n³), dominated by floating-point multiply-add operations.
Why This Particular Problem?
When Jack Dongarra and colleagues created LINPACK in the 1970s, they needed a benchmark that was:
- Mathematically rigorous: The correct answer can be verified
- Scalable: You can always use a bigger matrix
- Compute-intensive: Limited by arithmetic, not I/O
- Representative: Linear algebra was central to many scientific codes
For the computing of that era, these were reasonable choices. Dense linear algebra was a dominant workload. Memory bandwidth was relatively fast compared to compute.
HPL (High Performance LINPACK)
HPL is the parallel version of LINPACK for distributed systems:
HPL Parameters:
N: Problem size (matrix dimension)
NB: Block size
P×Q: Processor grid
Performance calculation:
FLOPS = (2/3 × N³ + 2 × N²) / Time
Result Interpretation
Typical HPL output:
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 100000 256 8 8 1234.56 5.40e+05
--------------------------------------------------------------------------------
Interpretation:
N = 100000 (matrix size)
P×Q = 64 processors
Time = 1234.56 seconds
Gflops = 540,000 GFLOPS = 540 TFLOPS
Efficiency Calculation
HPL Efficiency = Achieved FLOPS / Theoretical Peak FLOPS
Typical efficiency:
Well-optimized systems: 70-85%
Average systems: 50-70%
Unoptimized: 30-50%
Factors affecting efficiency:
- Problem size (larger is better)
- Network bandwidth and latency
- Memory bandwidth
- Software optimization level
HPCG
HPCG (High Performance Conjugate Gradients) was designed to complement LINPACK's shortcomings.
Design Motivation
LINPACK's problems:
- Compute-intensive, regular memory access
- Modern applications are often memory-intensive
- High LINPACK efficiency doesn't mean high application efficiency
HPCG's goals:
- Access patterns closer to real applications
- Measure memory system performance
- Reflect sparse matrix operations
Mathematical Background
HPCG uses conjugate gradient method to solve sparse linear systems:
Ax = b
Where A is a sparse matrix (from 3D 27-point stencil)
Main operations:
1. SpMV (Sparse Matrix-Vector multiply)
2. Vector operations (AXPY, dot product)
3. Multigrid preconditioning
## Graph500
**Graph500** measures graph analysis performance, reflecting data-intensive applications.
### Why Graph500 Is Needed
```text
Many important applications are graph-oriented:
- Social network analysis
- Web page ranking
- Bioinformatics
- Cybersecurity
Characteristics of these applications:
- Irregular data access
- Low computational density
- High memory bandwidth requirements
Benchmark Content
Graph500 contains three kernels:
1. Graph Construction
- Build graph data structure from edge list
- Measures data processing capability
2. BFS (Breadth-First Search)
- Breadth-first search from random starting point
- Primary performance metric
3. SSSP (Single-Source Shortest Path)
- Calculate shortest paths
- More complex graph algorithm
Performance Metric: GTEPS
GTEPS = Giga Traversed Edges Per Second
= Billions of edges traversed per second
Calculation:
GTEPS = Total_Edges / Time / 10^9
Typical values:
Top systems: 10,000+ GTEPS
General HPC: 100-1000 GTEPS
Single node: 1-10 GTEPS
Other HPC Benchmarks
STREAM
Classic benchmark for measuring memory bandwidth:
Four kernels:
Copy: a[i] = b[i]
Scale: a[i] = q * b[i]
Add: a[i] = b[i] + c[i]
Triad: a[i] = b[i] + q * c[i]
Result units: GB/s or MB/s
// STREAM Triad core code
#pragma omp parallel for
for (int i = 0; i < N; i++) {
a[i] = b[i] + scalar * c[i];
}
NAS Parallel Benchmarks (NPB)
Parallel computing benchmark suite developed by NASA:
Benchmark Description Characteristics
─────────────────────────────────────────────────────────
EP Embarrassingly Parallel No communication
MG Multigrid Long/short range comm
CG Conjugate Gradient Irregular access
FT FFT All-to-all comm
IS Integer Sort Random access
LU LU decomposition Regular comm
SP Scalar Pentadiagonal Regular comm
BT Block Tridiagonal Regular comm
OSU Micro-Benchmarks
Measures MPI communication performance:
Measurement items:
- Point-to-point latency
- Point-to-point bandwidth
- Collective operations (AllReduce, Broadcast, etc.)
- One-sided operations
Typical results:
InfiniBand HDR:
Latency: ~1 μs
Bandwidth: ~200 Gb/s
Ethernet 100G:
Latency: ~5 μs
Bandwidth: ~100 Gb/s
Result Analysis and Reporting
Performance Analysis
When analyzing HPC benchmark results, consider:
1. Efficiency
Achieved performance / Theoretical peak
2. Scalability
How performance changes with node count
3. Bottleneck identification
- High HPL but low HPCG → Memory bottleneck
- Low Graph500 → Memory latency issues
- High OSU latency → Network problems
4. Energy efficiency
GFLOPS/W or GTEPS/W
Common Issues
1. Low HPL efficiency
- Problem size not large enough
- Block size not suitable
- Network becoming bottleneck
- BLAS library not optimized
2. Very low HPCG efficiency
- This is normal (typically 1-5%)
- Reflects memory system limitations
- Can try optimizing memory configuration
3. Poor Graph500 performance
- Memory latency is key
- NUMA configuration matters
- Consider using huge pages
Future of HPC Benchmarks
Emerging Benchmarks
1. HPL-MxP (Mixed Precision)
- Uses mixed precision
- Reflects AI hardware capabilities
- Tracking started in 2024
2. MLPerf HPC
- AI applications on HPC
- Scientific computing + machine learning
3. IO500
- Storage system performance
- Important for data-intensive applications
Trends
1. From FLOPS to multi-dimensional metrics
- Performance isn't just compute speed
- Memory, communication, energy efficiency all matter
2. Application-oriented
- Real applications more meaningful than synthetic benchmarks
- Mini-apps becoming trend
3. Energy efficiency first
- GREEN500 importance increasing
- Power becoming design constraint
The End of Dr. Zhang's Story
Six months later, Dr. Zhang's research group completed their analysis. The "slower" new supercomputer wasn't slower after all—it just required different optimization.
The old system had high memory bandwidth per core, which matched their original code. The new system had more compute per core but less bandwidth, requiring them to restructure their algorithms to improve arithmetic intensity.
After optimization, the new system ran their climate simulation in 12 hours instead of 48—a 4× improvement. But they had to earn that improvement through months of algorithm work.
"LINPACK told us nothing about this," Dr. Zhang said at a department meeting. "The new machine had 10× the LINPACK score, but our speedup was 4× after significant effort. Someone using a different code might get 8×. Someone with an even more memory-bound code might get 0.5×."
"So what's the point of LINPACK?" a student asked.
"It's a common yardstick," Dr. Zhang replied. "It tells you something about the system, just not everything. The real lesson is that no single benchmark captures all aspects of performance. The TOP500 ranks supercomputers, but it doesn't rank how well they'll run your code."
Summary
HPC Benchmarks provide standard methods for evaluating supercomputer performance:
Major Benchmarks
- HPL/LINPACK: Dense linear algebra, TOP500 foundation
- HPCG: Sparse operations, closer to real applications
- Graph500: Graph analysis, data-intensive
- STREAM: Memory bandwidth
Key Insights
- High HPL efficiency doesn't mean high application efficiency
- HPCG efficiency is typically only 1-5%
- Different benchmarks measure different aspects
Practical Recommendations
- Use multiple benchmarks to evaluate systems
- Focus on efficiency rather than absolute performance
- Consider energy efficiency (GREEN500)
- Combine with application benchmarks for evaluation