Chapter 5: CPU Benchmarks

Part II: Tools


"Benchmarks are like statistics: you can prove anything with them if you try hard enough." — Unknown

The Dhrystone Revelation

In 1984, Reinhold Weicker released the Dhrystone benchmark. It's a short C program designed to measure CPU integer performance. Over thirty years later, it's still widely used.

But Dhrystone has a fundamental problem. Let me start with a story.

A few years ago, I was evaluating two embedded processors. Vendor A claimed 3.0 DMIPS/MHz; Vendor B claimed 2.8 DMIPS/MHz. A looked faster, right?

We bought two development boards and ran Dhrystone:

Chip A: 3.1 DMIPS/MHz (matches spec)
Chip B: 2.9 DMIPS/MHz (matches spec)

Great, specs are accurate. Then we ran our actual application—an image processing pipeline:

Chip A: 45 fps
Chip B: 62 fps

Wait, Chip B is 38% faster? But A has higher DMIPS!

This is Dhrystone's problem.

Why Dhrystone Is Unreliable

Problem 1: Too Small, Fits in Cache

The entire Dhrystone program is only a few KB. On modern processors, it fits entirely in L1 instruction cache. This means it measures "best case," not real-world performance.

Dhrystone code size: ~4 KB
L1 I-cache size:     32-64 KB

Result: 100% cache hit rate (unrealistic)

Problem 2: Compilers Can "Cheat"

Dhrystone's source code has computations that can be optimized away. Smart compilers can dramatically boost scores.

// A piece of Dhrystone code
Proc_1(Ptr_Val_Par)
{
    // This function's result might not be used
    // Compiler might optimize the entire function away
}

This is why DMIPS numbers sometimes include compiler versions:

"3.0 DMIPS/MHz (GCC 4.8, -O2)"
"4.2 DMIPS/MHz (Commercial Compiler X, -O3)"

Same chip, different compilers, 40% score difference. Are we measuring CPU or compiler?

Problem 3: Doesn't Represent Real Workloads

Dhrystone was designed in 1984, based on "typical" instruction distributions of that era. Modern programs are completely different:

  • More memory access
  • More complex control flow
  • Larger working sets
  • More SIMD and floating-point operations

Using Dhrystone to predict modern application performance is like using 1984 traffic data to predict today's congestion.

CoreMark: The Modern Alternative

EEMBC (Embedded Microprocessor Benchmark Consortium) released CoreMark in 2009 as a Dhrystone replacement.

CoreMark's Improvements

1. Prevents Compiler Cheating

CoreMark results are validated. If the compiler optimizes away computations, validation fails.

// CoreMark uses CRC to validate results
crc = crc_calc(result);
if (crc != EXPECTED_CRC) {
    // Compiler cheated, result invalid
}

2. Larger Code Footprint

CoreMark is about 16-32 KB—larger than Dhrystone, but may still fit in L1 cache.

3. More Modern Workload Mix

Includes list processing, matrix operations, state machines—closer to modern applications.

CoreMark's Limitations

CoreMark is better than Dhrystone, but still has limits:

  1. Still synthetic — not a real application
  2. Still small — mainly measures cache-hot performance
  3. Single score — can't distinguish different workload types

SPEC CPU: The Industry Gold Standard

For serious CPU performance evaluation, SPEC CPU is the industry standard.

What Is SPEC CPU

SPEC (Standard Performance Evaluation Corporation) maintains several benchmark suites. SPEC CPU includes:

  • SPECint: Integer operations (compilers, compression, database engines, etc.)
  • SPECfp: Floating-point operations (scientific computing, simulation, etc.)

Each suite contains a dozen real applications, not synthetic code.

SPEC CPU 2006 Composition

SPECint 2006 (Integer)
----------------------
400.perlbench      Perl interpreter
401.bzip2          Compression
403.gcc            C compiler
429.mcf            Combinatorial optimization
445.gobmk          AI: Go game
456.hmmer          Search gene sequence
458.sjeng          AI: Chess
462.libquantum     Quantum computing simulation
464.h264ref        Video compression
471.omnetpp        Network simulation
473.astar          Path-finding
483.xalancbmk      XML processing

SPECfp 2006 (Floating Point)
----------------------------
410.bwaves         Fluid dynamics
416.gamess         Quantum chemistry
433.milc           Physics: QCD
434.zeusmp         Physics: CFD
... (and more)

SPEC 2006 is still widely used in academia because:

  • Many published papers use 2006 as baseline
  • Rich historical data for comparison
  • Some benchmarks (like mcf, gcc) are classic memory-bound and compute-bound representatives

SPEC CPU 2017 Composition

SPECint 2017 Rate (Integer)
----------------------------
500.perlbench_r    Perl interpreter
502.gcc_r          C compiler
505.mcf_r          Route planning
520.omnetpp_r      Network simulation
523.xalancbmk_r    XML processing
525.x264_r         Video compression
531.deepsjeng_r    AI game playing
541.leela_r        Monte Carlo Go
548.exchange2_r    AI puzzle solving
557.xz_r           Data compression

SPECfp 2017 Rate (Floating Point)
---------------------------------
503.bwaves_r       Fluid dynamics
507.cactuBSSN_r    Physics
508.namd_r         Molecular dynamics
510.parest_r       Biomedical imaging
511.povray_r       Ray tracing
... (and more)

2017 version improvements:

  • Larger working sets (reflecting modern applications)
  • More multi-threaded workloads (rate and speed versions)
  • Removed some outdated benchmarks
  • Added AI/ML-related workloads (like leela)

Why SPEC Is More Trustworthy

1. Real Applications

These aren't synthetic code written for benchmarking. They're actually used software.

2. Strict Execution Rules

  • Must run complete workloads (no cherry-picking)
  • Must report complete environment configuration
  • Results must be reviewed by SPEC before publication

3. Composite Score from Multiple Workloads

A single workload can be specifically optimized. But optimizing a dozen different applications simultaneously requires genuine architectural improvements.

SPEC's Downsides

  1. Expensive — Commercial licensing isn't cheap
  2. Time-consuming — Running the full suite can take days
  3. Complex — Requires expertise to set up and interpret correctly

For embedded systems and everyday comparisons, SPEC may be overkill.

Whetstone: The Floating-Point Veteran

Whetstone is a floating-point benchmark released in 1972—even older than Dhrystone. It measures MWIPS (Millions of Whetstone Instructions Per Second).

Why People Still Use It

  1. Historical data — Decades of data for comparison
  2. Simple — Runs in minutes
  3. Floating-point focus — If you only care about FP performance

Why You Shouldn't Use It

Same problems as Dhrystone: too old, too small, too easy to optimize.

Modern alternatives are LINPACK (for HPC rankings) or SPEC FP.

How to Use CPU Benchmarks Correctly

Rule 1: Know What You're Measuring

Each benchmark measures different things:

BenchmarkPrimary MeasurementUse Case
DhrystoneInteger ops (small program)Quick comparison, embedded
CoreMarkInteger ops (more modern)Embedded, MCU
SPEC CPUReal application performanceServers, desktops
WhetstoneFloating-point (old)Historical comparison
LINPACKLinear algebraHPC

Rule 2: Don't Just Look at a Single Number

Bad:  "Chip A: 5000 CoreMark"

Good: "Chip A: 5000 CoreMark @ 1GHz
       - CPU: ARM Cortex-A72, 32KB L1-I, 32KB L1-D, 1MB L2
       - Compiler: GCC 11.2 -O3 -mcpu=cortex-a72
       - CoreMark/MHz: 5.0"

A single number hides too much information. Reports should include hardware specs, compiler version, and flags.

Rule 3: Ensure Identical Conditions When Comparing

Bad:  "Chip A (3.0 GHz): 15000 CoreMark
       Chip B (2.5 GHz): 12000 CoreMark
       Conclusion: A is faster"

Good: "Chip A: 5000 CoreMark/GHz
       Chip B: 4800 CoreMark/GHz
       Conclusion: At same frequency, A is 4% faster"

Normalize to per-MHz or per-watt for fair comparison.

Rule 4: Cross-Validate with Multiple Benchmarks

Back to my opening story—Chip A had higher DMIPS, but Chip B was faster in practice.

If we had run more benchmarks:

Chip A:
  Dhrystone: 3.1 DMIPS/MHz
  CoreMark:  3.2 CM/MHz
  Memory BW: 1.5 GB/s

Chip B:
  Dhrystone: 2.9 DMIPS/MHz
  CoreMark:  3.0 CM/MHz
  Memory BW: 3.2 GB/s      ← Big difference here!

Chip B's memory bandwidth was 2× that of A. Our image processing pipeline was memory-bound, so B was faster.

A single benchmark is never enough.

Rule 5: Be Careful with Cross-Architecture Comparisons

Dhrystone/CoreMark scores across different CPU architectures can't be directly compared:

Typical DMIPS/MHz Reference Values (varies with compiler and optimization)
──────────────────────────────────────────────────────────────────────────
Architecture          DMIPS/MHz    CoreMark/MHz
──────────────────────────────────────────────────────────────────────────
ARM Cortex-M0+        0.95         2.4
ARM Cortex-M3         1.25         3.3
ARM Cortex-M4         1.25         3.4
ARM Cortex-M7         2.14         5.0
ARM Cortex-A53        2.3          5.5
ARM Cortex-A72        4.7          8.0
──────────────────────────────────────────────────────────────────────────
RISC-V RV32IMC        1.2-1.8      2.5-3.5
SiFive E31 (RV32IMAC) 1.61         3.1
SiFive E76 (RV32IMAFC)2.36         4.5
SiFive U74 (RV64GC)   2.5          5.0
──────────────────────────────────────────────────────────────────────────
x86 Skylake           ~5.0         ~8.0
x86 Zen 3             ~5.5         ~9.0
──────────────────────────────────────────────────────────────────────────

Note: These numbers are highly dependent on compiler version, optimization level, and ISA extensions. The same RISC-V core can vary 30% across different compilers.

Practical Tips for Running Benchmarks

Setting Up the Environment

# 1. Lock CPU frequency
sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -f 2.0GHz

# 2. Disable turbo
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

# 3. Pin to CPU
taskset -c 0 ./coremark

# 4. Set priority
sudo nice -n -20 ./coremark

Record Complete Environment

## Benchmark Environment

- **CPU**: Intel Core i7-10700 @ 2.9 GHz (locked)
- **Memory**: 32GB DDR4-3200
- **OS**: Ubuntu 22.04, kernel 5.15.0
- **Compiler**: GCC 11.2.0
- **Flags**: -O3 -march=native
- **CoreMark Version**: 1.01
- **Iterations**: 30000 (runtime ~10 seconds)
- **Date**: 2024-01-15

Run Multiple Times, Report Statistics

CoreMark Results (10 runs):
  Mean:   24567.3 iterations/sec
  StdDev: 123.4 (0.5%)
  Min:    24312
  Max:    24789

Back to That Image Processing Project

When we discovered Chip A had higher Dhrystone scores but worse real performance, I learned an important lesson:

Benchmarks are tools, not answers.

We ultimately chose Chip B because our application was memory-bound. If our application had been compute-bound, we might have chosen Chip A.

The correct approach is:

  1. First understand your workload characteristics (CPU-bound? Memory-bound? I/O-bound?)
  2. Choose appropriate benchmarks to evaluate
  3. Cross-validate with multiple benchmarks
  4. Finally, test on your actual application

No benchmark can replace testing on your actual application.

Summary

CPU benchmarks are tools for evaluating processor performance, but each has limitations:

Dhrystone

  • Pros: Fast, universal, lots of historical data
  • Cons: Too small, can be compiler-optimized, doesn't represent modern workloads
  • Use for: Quick embedded system comparisons

CoreMark

  • Pros: More modern than Dhrystone, anti-cheat design
  • Cons: Still synthetic, still small
  • Use for: Embedded systems, MCU evaluation

SPEC CPU

  • Pros: Real applications, strict rules, industry standard
  • Cons: Expensive, time-consuming, complex
  • Use for: Formal server/desktop system evaluation

Correct Usage

  • Know what each benchmark measures
  • Don't just look at a single number
  • Ensure identical comparison conditions
  • Cross-validate with multiple benchmarks
  • Ultimately test on your actual application