Chapter 5: CPU Benchmarks

Part II: Tools

"Benchmarks are like statistics: you can prove anything with them if you try hard enough." — Unknown

The Dhrystone Revelation

In 1984, Reinhold Weicker released the Dhrystone benchmark. It's a short C program designed to measure CPU integer performance. Over thirty years later, it's still widely used.

But Dhrystone has a fundamental problem. Let me start with a story.

A few years ago, I was evaluating two embedded processors. Vendor A claimed 3.0 DMIPS/MHz; Vendor B claimed 2.8 DMIPS/MHz. A looked faster, right?

We bought two development boards and ran Dhrystone:

Chip A: 3.1 DMIPS/MHz (matches spec)
Chip B: 2.9 DMIPS/MHz (matches spec)

Great, specs are accurate. Then we ran our actual application—an image processing pipeline:

Chip A: 45 fps
Chip B: 62 fps

Wait, Chip B is 38% faster? But A has higher DMIPS!

This is Dhrystone's problem.

Why Dhrystone Is Unreliable

Problem 1: Too Small, Fits in Cache

The entire Dhrystone program is only a few KB. On modern processors, it fits entirely in L1 instruction cache. This means it measures "best case," not real-world performance.

Dhrystone code size: ~4 KB
L1 I-cache size:     32-64 KB

Result: 100% cache hit rate (unrealistic)

Problem 2: Compilers Can "Cheat"

Dhrystone's source code has computations that can be optimized away. Smart compilers can dramatically boost scores.

// A piece of Dhrystone code
Proc_1(Ptr_Val_Par)
{
    // This function's result might not be used
    // Compiler might optimize the entire function away
}

This is why DMIPS numbers sometimes include compiler versions:

"3.0 DMIPS/MHz (GCC 4.8, -O2)"
"4.2 DMIPS/MHz (Commercial Compiler X, -O3)"

Same chip, different compilers, 40% score difference. Are we measuring CPU or compiler?

Problem 3: Doesn't Represent Real Workloads

Dhrystone was designed in 1984, based on "typical" instruction distributions of that era. Modern programs are completely different:

More memory access
More complex control flow
Larger working sets
More SIMD and floating-point operations

Using Dhrystone to predict modern application performance is like using 1984 traffic data to predict today's congestion.

CoreMark: The Modern Alternative

EEMBC (Embedded Microprocessor Benchmark Consortium) released CoreMark in 2009 as a Dhrystone replacement.

CoreMark's Improvements

1. Prevents Compiler Cheating

CoreMark results are validated. If the compiler optimizes away computations, validation fails.

// CoreMark uses CRC to validate results
crc = crc_calc(result);
if (crc != EXPECTED_CRC) {
    // Compiler cheated, result invalid
}

2. Larger Code Footprint

CoreMark is about 16-32 KB—larger than Dhrystone, but may still fit in L1 cache.

3. More Modern Workload Mix

Includes list processing, matrix operations, state machines—closer to modern applications.

CoreMark's Limitations

CoreMark is better than Dhrystone, but still has limits:

Still synthetic — not a real application
Still small — mainly measures cache-hot performance
Single score — can't distinguish different workload types

SPEC CPU: The Industry Gold Standard

For serious CPU performance evaluation, SPEC CPU is the industry standard.

What Is SPEC CPU

SPEC (Standard Performance Evaluation Corporation) maintains several benchmark suites. SPEC CPU includes:

SPECint: Integer operations (compilers, compression, database engines, etc.)
SPECfp: Floating-point operations (scientific computing, simulation, etc.)

Each suite contains a dozen real applications, not synthetic code.

SPEC CPU 2006 Composition

SPECint 2006 (Integer)
----------------------
400.perlbench      Perl interpreter
401.bzip2          Compression
403.gcc            C compiler
429.mcf            Combinatorial optimization
445.gobmk          AI: Go game
456.hmmer          Search gene sequence
458.sjeng          AI: Chess
462.libquantum     Quantum computing simulation
464.h264ref        Video compression
471.omnetpp        Network simulation
473.astar          Path-finding
483.xalancbmk      XML processing

SPECfp 2006 (Floating Point)
----------------------------
410.bwaves         Fluid dynamics
416.gamess         Quantum chemistry
433.milc           Physics: QCD
434.zeusmp         Physics: CFD
... (and more)

SPEC 2006 is still widely used in academia because:

Many published papers use 2006 as baseline
Rich historical data for comparison
Some benchmarks (like mcf, gcc) are classic memory-bound and compute-bound representatives

SPEC CPU 2017 Composition

SPECint 2017 Rate (Integer)
----------------------------
500.perlbench_r    Perl interpreter
502.gcc_r          C compiler
505.mcf_r          Route planning
520.omnetpp_r      Network simulation
523.xalancbmk_r    XML processing
525.x264_r         Video compression
531.deepsjeng_r    AI game playing
541.leela_r        Monte Carlo Go
548.exchange2_r    AI puzzle solving
557.xz_r           Data compression

SPECfp 2017 Rate (Floating Point)
---------------------------------
503.bwaves_r       Fluid dynamics
507.cactuBSSN_r    Physics
508.namd_r         Molecular dynamics
510.parest_r       Biomedical imaging
511.povray_r       Ray tracing
... (and more)

2017 version improvements:

Larger working sets (reflecting modern applications)
More multi-threaded workloads (rate and speed versions)
Removed some outdated benchmarks
Added AI/ML-related workloads (like leela)

Why SPEC Is More Trustworthy

1. Real Applications

These aren't synthetic code written for benchmarking. They're actually used software.

2. Strict Execution Rules

Must run complete workloads (no cherry-picking)
Must report complete environment configuration
Results must be reviewed by SPEC before publication

3. Composite Score from Multiple Workloads

A single workload can be specifically optimized. But optimizing a dozen different applications simultaneously requires genuine architectural improvements.

SPEC's Downsides

Expensive — Commercial licensing isn't cheap
Time-consuming — Running the full suite can take days
Complex — Requires expertise to set up and interpret correctly

For embedded systems and everyday comparisons, SPEC may be overkill.

Whetstone: The Floating-Point Veteran

Whetstone is a floating-point benchmark released in 1972—even older than Dhrystone. It measures MWIPS (Millions of Whetstone Instructions Per Second).

Why People Still Use It

Historical data — Decades of data for comparison
Simple — Runs in minutes
Floating-point focus — If you only care about FP performance

Why You Shouldn't Use It

Same problems as Dhrystone: too old, too small, too easy to optimize.

Modern alternatives are LINPACK (for HPC rankings) or SPEC FP.

How to Use CPU Benchmarks Correctly

Rule 1: Know What You're Measuring

Each benchmark measures different things:

Benchmark	Primary Measurement	Use Case
Dhrystone	Integer ops (small program)	Quick comparison, embedded
CoreMark	Integer ops (more modern)	Embedded, MCU
SPEC CPU	Real application performance	Servers, desktops
Whetstone	Floating-point (old)	Historical comparison
LINPACK	Linear algebra	HPC

Rule 2: Don't Just Look at a Single Number

Bad:  "Chip A: 5000 CoreMark"

Good: "Chip A: 5000 CoreMark @ 1GHz
       - CPU: ARM Cortex-A72, 32KB L1-I, 32KB L1-D, 1MB L2
       - Compiler: GCC 11.2 -O3 -mcpu=cortex-a72
       - CoreMark/MHz: 5.0"

A single number hides too much information. Reports should include hardware specs, compiler version, and flags.

Rule 3: Ensure Identical Conditions When Comparing

Bad:  "Chip A (3.0 GHz): 15000 CoreMark
       Chip B (2.5 GHz): 12000 CoreMark
       Conclusion: A is faster"

Good: "Chip A: 5000 CoreMark/GHz
       Chip B: 4800 CoreMark/GHz
       Conclusion: At same frequency, A is 4% faster"

Normalize to per-MHz or per-watt for fair comparison.

Rule 4: Cross-Validate with Multiple Benchmarks

Back to my opening story—Chip A had higher DMIPS, but Chip B was faster in practice.

If we had run more benchmarks:

Chip A:
  Dhrystone: 3.1 DMIPS/MHz
  CoreMark:  3.2 CM/MHz
  Memory BW: 1.5 GB/s

Chip B:
  Dhrystone: 2.9 DMIPS/MHz
  CoreMark:  3.0 CM/MHz
  Memory BW: 3.2 GB/s      ← Big difference here!

Chip B's memory bandwidth was 2× that of A. Our image processing pipeline was memory-bound, so B was faster.

A single benchmark is never enough.

Rule 5: Be Careful with Cross-Architecture Comparisons

Dhrystone/CoreMark scores across different CPU architectures can't be directly compared:

Typical DMIPS/MHz Reference Values (varies with compiler and optimization)
──────────────────────────────────────────────────────────────────────────
Architecture          DMIPS/MHz    CoreMark/MHz
──────────────────────────────────────────────────────────────────────────
ARM Cortex-M0+        0.95         2.4
ARM Cortex-M3         1.25         3.3
ARM Cortex-M4         1.25         3.4
ARM Cortex-M7         2.14         5.0
ARM Cortex-A53        2.3          5.5
ARM Cortex-A72        4.7          8.0
──────────────────────────────────────────────────────────────────────────
RISC-V RV32IMC        1.2-1.8      2.5-3.5
SiFive E31 (RV32IMAC) 1.61         3.1
SiFive E76 (RV32IMAFC)2.36         4.5
SiFive U74 (RV64GC)   2.5          5.0
──────────────────────────────────────────────────────────────────────────
x86 Skylake           ~5.0         ~8.0
x86 Zen 3             ~5.5         ~9.0
──────────────────────────────────────────────────────────────────────────

Note: These numbers are highly dependent on compiler version, optimization level, and ISA extensions. The same RISC-V core can vary 30% across different compilers.

Practical Tips for Running Benchmarks

Setting Up the Environment

# 1. Lock CPU frequency
sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -f 2.0GHz

# 2. Disable turbo
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

# 3. Pin to CPU
taskset -c 0 ./coremark

# 4. Set priority
sudo nice -n -20 ./coremark

Record Complete Environment

## Benchmark Environment

- **CPU**: Intel Core i7-10700 @ 2.9 GHz (locked)
- **Memory**: 32GB DDR4-3200
- **OS**: Ubuntu 22.04, kernel 5.15.0
- **Compiler**: GCC 11.2.0
- **Flags**: -O3 -march=native
- **CoreMark Version**: 1.01
- **Iterations**: 30000 (runtime ~10 seconds)
- **Date**: 2024-01-15

Run Multiple Times, Report Statistics

CoreMark Results (10 runs):
  Mean:   24567.3 iterations/sec
  StdDev: 123.4 (0.5%)
  Min:    24312
  Max:    24789

Back to That Image Processing Project

When we discovered Chip A had higher Dhrystone scores but worse real performance, I learned an important lesson:

Benchmarks are tools, not answers.

We ultimately chose Chip B because our application was memory-bound. If our application had been compute-bound, we might have chosen Chip A.

The correct approach is:

First understand your workload characteristics (CPU-bound? Memory-bound? I/O-bound?)
Choose appropriate benchmarks to evaluate
Cross-validate with multiple benchmarks
Finally, test on your actual application

No benchmark can replace testing on your actual application.

Summary

CPU benchmarks are tools for evaluating processor performance, but each has limitations:

Dhrystone

Pros: Fast, universal, lots of historical data
Cons: Too small, can be compiler-optimized, doesn't represent modern workloads
Use for: Quick embedded system comparisons

CoreMark

Pros: More modern than Dhrystone, anti-cheat design
Cons: Still synthetic, still small
Use for: Embedded systems, MCU evaluation

SPEC CPU

Pros: Real applications, strict rules, industry standard
Cons: Expensive, time-consuming, complex
Use for: Formal server/desktop system evaluation

Correct Usage

Know what each benchmark measures
Don't just look at a single number
Ensure identical comparison conditions
Cross-validate with multiple benchmarks
Ultimately test on your actual application

Performance and Benchmarking