Chapter 25: HPC Benchmarks

Part VII: AI/HPC

"The TOP500 list is not about who has the biggest computer, but about who can solve the biggest problems." — Jack Dongarra

When Your Supercomputer Ranks #1 But Can't Run Your Code

Dr. Zhang's research group had just gotten access to their national lab's newest supercomputer—ranked in the top 20 of the TOP500 list. Exciting times. Their climate simulation code, which took 48 hours on the previous system, should scream on this new machine.

It didn't. The simulation took 52 hours. Slower than before.

"How is this possible?" Dr. Zhang asked the system administrator. "This machine has 10x the LINPACK score of our old system."

The admin nodded sympathetically. "LINPACK measures dense linear algebra. Your code is sparse matrix-heavy with irregular memory access patterns. This new system has great compute, but the memory bandwidth per core actually went down. Your code is memory-bound, not compute-bound."

Dr. Zhang had just learned one of the fundamental lessons of HPC benchmarking: the benchmark that ranks supercomputers tells you almost nothing about how those supercomputers will perform on your specific workload.

This chapter explores the benchmarks used to evaluate HPC systems, their strengths, their limitations, and how to interpret them for real applications.

A Brief History of HPC Performance Measurement

Since 1993, the TOP500 list has been published twice yearly, ranking the world's most powerful supercomputers. It's become the definitive measure of "who has the biggest computer"—and also a cautionary tale about what benchmarks can and cannot tell you.

The performance growth has been staggering:

Year	Milestone	FLOPS
1993	First TOP500 list	59.7 GFLOPS (CM-5, Los Alamos)
1997	First teraFLOPS	1.0 TFLOPS (ASCI Red, Intel)
2008	First petaFLOPS	1.0 PFLOPS (Roadrunner, IBM)
2022	First exaFLOPS	1.1 EFLOPS (Frontier, AMD)

Over 30 years, peak performance has grown approximately 20 million times. But how is this measured, and what does it actually mean?

LINPACK: The Benchmark That Defines the TOP500

LINPACK (and its modern parallel version, HPL—High Performance LINPACK) is the benchmark used for TOP500 rankings. It measures performance on a specific mathematical operation: solving a dense system of linear equations.

What LINPACK Actually Computes

The problem is deceptively simple:

Solve Ax = b for x

Where:
  A is an n×n dense matrix
  b is a known vector
  x is the unknown we're solving for

The standard approach uses LU decomposition: factorize A into lower and upper triangular matrices (L and U), then solve through forward and backward substitution. The computational complexity is O(n³), dominated by floating-point multiply-add operations.

Why This Particular Problem?

When Jack Dongarra and colleagues created LINPACK in the 1970s, they needed a benchmark that was:

Mathematically rigorous: The correct answer can be verified
Scalable: You can always use a bigger matrix
Compute-intensive: Limited by arithmetic, not I/O
Representative: Linear algebra was central to many scientific codes

For the computing of that era, these were reasonable choices. Dense linear algebra was a dominant workload. Memory bandwidth was relatively fast compared to compute.

HPL (High Performance LINPACK)

HPL is the parallel version of LINPACK for distributed systems:

HPL Parameters:
  N:  Problem size (matrix dimension)
  NB: Block size
  P×Q: Processor grid

Performance calculation:
  FLOPS = (2/3 × N³ + 2 × N²) / Time

Result Interpretation

Typical HPL output:

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      100000   256     8     8            1234.56              5.40e+05
--------------------------------------------------------------------------------

Interpretation:
  N = 100000 (matrix size)
  P×Q = 64 processors
  Time = 1234.56 seconds
  Gflops = 540,000 GFLOPS = 540 TFLOPS

Efficiency Calculation

HPL Efficiency = Achieved FLOPS / Theoretical Peak FLOPS

Typical efficiency:
  Well-optimized systems: 70-85%
  Average systems:        50-70%
  Unoptimized:           30-50%

Factors affecting efficiency:
  - Problem size (larger is better)
  - Network bandwidth and latency
  - Memory bandwidth
  - Software optimization level

HPCG

HPCG (High Performance Conjugate Gradients) was designed to complement LINPACK's shortcomings.

Design Motivation

LINPACK's problems:
  - Compute-intensive, regular memory access
  - Modern applications are often memory-intensive
  - High LINPACK efficiency doesn't mean high application efficiency

HPCG's goals:
  - Access patterns closer to real applications
  - Measure memory system performance
  - Reflect sparse matrix operations

Mathematical Background

HPCG uses conjugate gradient method to solve sparse linear systems:

Ax = b

Where A is a sparse matrix (from 3D 27-point stencil)

Main operations:
1. SpMV (Sparse Matrix-Vector multiply)
2. Vector operations (AXPY, dot product)
3. Multigrid preconditioning

## Graph500

**Graph500** measures graph analysis performance, reflecting data-intensive applications.

### Why Graph500 Is Needed

```text
Many important applications are graph-oriented:
  - Social network analysis
  - Web page ranking
  - Bioinformatics
  - Cybersecurity

Characteristics of these applications:
  - Irregular data access
  - Low computational density
  - High memory bandwidth requirements

Benchmark Content

Graph500 contains three kernels:

1. Graph Construction
   - Build graph data structure from edge list
   - Measures data processing capability

2. BFS (Breadth-First Search)
   - Breadth-first search from random starting point
   - Primary performance metric

3. SSSP (Single-Source Shortest Path)
   - Calculate shortest paths
   - More complex graph algorithm

Performance Metric: GTEPS

GTEPS = Giga Traversed Edges Per Second
      = Billions of edges traversed per second

Calculation:
  GTEPS = Total_Edges / Time / 10^9

Typical values:
  Top systems:    10,000+ GTEPS
  General HPC:    100-1000 GTEPS
  Single node:    1-10 GTEPS

Other HPC Benchmarks

STREAM

Classic benchmark for measuring memory bandwidth:

Four kernels:

Copy:   a[i] = b[i]
Scale:  a[i] = q * b[i]
Add:    a[i] = b[i] + c[i]
Triad:  a[i] = b[i] + q * c[i]

Result units: GB/s or MB/s

// STREAM Triad core code
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    a[i] = b[i] + scalar * c[i];
}

NAS Parallel Benchmarks (NPB)

Parallel computing benchmark suite developed by NASA:

Benchmark    Description                 Characteristics
─────────────────────────────────────────────────────────
EP           Embarrassingly Parallel     No communication
MG           Multigrid                   Long/short range comm
CG           Conjugate Gradient          Irregular access
FT           FFT                         All-to-all comm
IS           Integer Sort                Random access
LU           LU decomposition            Regular comm
SP           Scalar Pentadiagonal        Regular comm
BT           Block Tridiagonal           Regular comm

OSU Micro-Benchmarks

Measures MPI communication performance:

Measurement items:
  - Point-to-point latency
  - Point-to-point bandwidth
  - Collective operations (AllReduce, Broadcast, etc.)
  - One-sided operations

Typical results:
  InfiniBand HDR:
    Latency: ~1 μs
    Bandwidth: ~200 Gb/s

  Ethernet 100G:
    Latency: ~5 μs
    Bandwidth: ~100 Gb/s

Result Analysis and Reporting

Performance Analysis

When analyzing HPC benchmark results, consider:

1. Efficiency
   Achieved performance / Theoretical peak

2. Scalability
   How performance changes with node count

3. Bottleneck identification
   - High HPL but low HPCG → Memory bottleneck
   - Low Graph500 → Memory latency issues
   - High OSU latency → Network problems

4. Energy efficiency
   GFLOPS/W or GTEPS/W

Common Issues

1. Low HPL efficiency
   - Problem size not large enough
   - Block size not suitable
   - Network becoming bottleneck
   - BLAS library not optimized

2. Very low HPCG efficiency
   - This is normal (typically 1-5%)
   - Reflects memory system limitations
   - Can try optimizing memory configuration

3. Poor Graph500 performance
   - Memory latency is key
   - NUMA configuration matters
   - Consider using huge pages

Future of HPC Benchmarks

Emerging Benchmarks

1. HPL-MxP (Mixed Precision)
   - Uses mixed precision
   - Reflects AI hardware capabilities
   - Tracking started in 2024

2. MLPerf HPC
   - AI applications on HPC
   - Scientific computing + machine learning

3. IO500
   - Storage system performance
   - Important for data-intensive applications

Trends

1. From FLOPS to multi-dimensional metrics
   - Performance isn't just compute speed
   - Memory, communication, energy efficiency all matter

2. Application-oriented
   - Real applications more meaningful than synthetic benchmarks
   - Mini-apps becoming trend

3. Energy efficiency first
   - GREEN500 importance increasing
   - Power becoming design constraint

The End of Dr. Zhang's Story

Six months later, Dr. Zhang's research group completed their analysis. The "slower" new supercomputer wasn't slower after all—it just required different optimization.

The old system had high memory bandwidth per core, which matched their original code. The new system had more compute per core but less bandwidth, requiring them to restructure their algorithms to improve arithmetic intensity.

After optimization, the new system ran their climate simulation in 12 hours instead of 48—a 4× improvement. But they had to earn that improvement through months of algorithm work.

"LINPACK told us nothing about this," Dr. Zhang said at a department meeting. "The new machine had 10× the LINPACK score, but our speedup was 4× after significant effort. Someone using a different code might get 8×. Someone with an even more memory-bound code might get 0.5×."

"So what's the point of LINPACK?" a student asked.

"It's a common yardstick," Dr. Zhang replied. "It tells you something about the system, just not everything. The real lesson is that no single benchmark captures all aspects of performance. The TOP500 ranks supercomputers, but it doesn't rank how well they'll run your code."

Summary

HPC Benchmarks provide standard methods for evaluating supercomputer performance:

Major Benchmarks

HPL/LINPACK: Dense linear algebra, TOP500 foundation
HPCG: Sparse operations, closer to real applications
Graph500: Graph analysis, data-intensive
STREAM: Memory bandwidth

Key Insights

High HPL efficiency doesn't mean high application efficiency
HPCG efficiency is typically only 1-5%
Different benchmarks measure different aspects

Practical Recommendations

Use multiple benchmarks to evaluate systems
Focus on efficiency rather than absolute performance
Consider energy efficiency (GREEN500)
Combine with application benchmarks for evaluation

Performance and Benchmarking