Chapter 5: CPU Benchmarks
Part II: Tools
"Benchmarks are like statistics: you can prove anything with them if you try hard enough." — Unknown
The Dhrystone Revelation
In 1984, Reinhold Weicker released the Dhrystone benchmark. It's a short C program designed to measure CPU integer performance. Over thirty years later, it's still widely used.
But Dhrystone has a fundamental problem. Let me start with a story.
A few years ago, I was evaluating two embedded processors. Vendor A claimed 3.0 DMIPS/MHz; Vendor B claimed 2.8 DMIPS/MHz. A looked faster, right?
We bought two development boards and ran Dhrystone:
Chip A: 3.1 DMIPS/MHz (matches spec)
Chip B: 2.9 DMIPS/MHz (matches spec)
Great, specs are accurate. Then we ran our actual application—an image processing pipeline:
Chip A: 45 fps
Chip B: 62 fps
Wait, Chip B is 38% faster? But A has higher DMIPS!
This is Dhrystone's problem.
Why Dhrystone Is Unreliable
Problem 1: Too Small, Fits in Cache
The entire Dhrystone program is only a few KB. On modern processors, it fits entirely in L1 instruction cache. This means it measures "best case," not real-world performance.
Dhrystone code size: ~4 KB
L1 I-cache size: 32-64 KB
Result: 100% cache hit rate (unrealistic)
Problem 2: Compilers Can "Cheat"
Dhrystone's source code has computations that can be optimized away. Smart compilers can dramatically boost scores.
// A piece of Dhrystone code
Proc_1(Ptr_Val_Par)
{
// This function's result might not be used
// Compiler might optimize the entire function away
}
This is why DMIPS numbers sometimes include compiler versions:
"3.0 DMIPS/MHz (GCC 4.8, -O2)"
"4.2 DMIPS/MHz (Commercial Compiler X, -O3)"
Same chip, different compilers, 40% score difference. Are we measuring CPU or compiler?
Problem 3: Doesn't Represent Real Workloads
Dhrystone was designed in 1984, based on "typical" instruction distributions of that era. Modern programs are completely different:
- More memory access
- More complex control flow
- Larger working sets
- More SIMD and floating-point operations
Using Dhrystone to predict modern application performance is like using 1984 traffic data to predict today's congestion.
CoreMark: The Modern Alternative
EEMBC (Embedded Microprocessor Benchmark Consortium) released CoreMark in 2009 as a Dhrystone replacement.
CoreMark's Improvements
1. Prevents Compiler Cheating
CoreMark results are validated. If the compiler optimizes away computations, validation fails.
// CoreMark uses CRC to validate results
crc = crc_calc(result);
if (crc != EXPECTED_CRC) {
// Compiler cheated, result invalid
}
2. Larger Code Footprint
CoreMark is about 16-32 KB—larger than Dhrystone, but may still fit in L1 cache.
3. More Modern Workload Mix
Includes list processing, matrix operations, state machines—closer to modern applications.
CoreMark's Limitations
CoreMark is better than Dhrystone, but still has limits:
- Still synthetic — not a real application
- Still small — mainly measures cache-hot performance
- Single score — can't distinguish different workload types
SPEC CPU: The Industry Gold Standard
For serious CPU performance evaluation, SPEC CPU is the industry standard.
What Is SPEC CPU
SPEC (Standard Performance Evaluation Corporation) maintains several benchmark suites. SPEC CPU includes:
- SPECint: Integer operations (compilers, compression, database engines, etc.)
- SPECfp: Floating-point operations (scientific computing, simulation, etc.)
Each suite contains a dozen real applications, not synthetic code.
SPEC CPU 2006 Composition
SPECint 2006 (Integer)
----------------------
400.perlbench Perl interpreter
401.bzip2 Compression
403.gcc C compiler
429.mcf Combinatorial optimization
445.gobmk AI: Go game
456.hmmer Search gene sequence
458.sjeng AI: Chess
462.libquantum Quantum computing simulation
464.h264ref Video compression
471.omnetpp Network simulation
473.astar Path-finding
483.xalancbmk XML processing
SPECfp 2006 (Floating Point)
----------------------------
410.bwaves Fluid dynamics
416.gamess Quantum chemistry
433.milc Physics: QCD
434.zeusmp Physics: CFD
... (and more)
SPEC 2006 is still widely used in academia because:
- Many published papers use 2006 as baseline
- Rich historical data for comparison
- Some benchmarks (like
mcf,gcc) are classic memory-bound and compute-bound representatives
SPEC CPU 2017 Composition
SPECint 2017 Rate (Integer)
----------------------------
500.perlbench_r Perl interpreter
502.gcc_r C compiler
505.mcf_r Route planning
520.omnetpp_r Network simulation
523.xalancbmk_r XML processing
525.x264_r Video compression
531.deepsjeng_r AI game playing
541.leela_r Monte Carlo Go
548.exchange2_r AI puzzle solving
557.xz_r Data compression
SPECfp 2017 Rate (Floating Point)
---------------------------------
503.bwaves_r Fluid dynamics
507.cactuBSSN_r Physics
508.namd_r Molecular dynamics
510.parest_r Biomedical imaging
511.povray_r Ray tracing
... (and more)
2017 version improvements:
- Larger working sets (reflecting modern applications)
- More multi-threaded workloads (rate and speed versions)
- Removed some outdated benchmarks
- Added AI/ML-related workloads (like
leela)
Why SPEC Is More Trustworthy
1. Real Applications
These aren't synthetic code written for benchmarking. They're actually used software.
2. Strict Execution Rules
- Must run complete workloads (no cherry-picking)
- Must report complete environment configuration
- Results must be reviewed by SPEC before publication
3. Composite Score from Multiple Workloads
A single workload can be specifically optimized. But optimizing a dozen different applications simultaneously requires genuine architectural improvements.
SPEC's Downsides
- Expensive — Commercial licensing isn't cheap
- Time-consuming — Running the full suite can take days
- Complex — Requires expertise to set up and interpret correctly
For embedded systems and everyday comparisons, SPEC may be overkill.
Whetstone: The Floating-Point Veteran
Whetstone is a floating-point benchmark released in 1972—even older than Dhrystone. It measures MWIPS (Millions of Whetstone Instructions Per Second).
Why People Still Use It
- Historical data — Decades of data for comparison
- Simple — Runs in minutes
- Floating-point focus — If you only care about FP performance
Why You Shouldn't Use It
Same problems as Dhrystone: too old, too small, too easy to optimize.
Modern alternatives are LINPACK (for HPC rankings) or SPEC FP.
How to Use CPU Benchmarks Correctly
Rule 1: Know What You're Measuring
Each benchmark measures different things:
| Benchmark | Primary Measurement | Use Case |
|---|---|---|
| Dhrystone | Integer ops (small program) | Quick comparison, embedded |
| CoreMark | Integer ops (more modern) | Embedded, MCU |
| SPEC CPU | Real application performance | Servers, desktops |
| Whetstone | Floating-point (old) | Historical comparison |
| LINPACK | Linear algebra | HPC |
Rule 2: Don't Just Look at a Single Number
Bad: "Chip A: 5000 CoreMark"
Good: "Chip A: 5000 CoreMark @ 1GHz
- CPU: ARM Cortex-A72, 32KB L1-I, 32KB L1-D, 1MB L2
- Compiler: GCC 11.2 -O3 -mcpu=cortex-a72
- CoreMark/MHz: 5.0"
A single number hides too much information. Reports should include hardware specs, compiler version, and flags.
Rule 3: Ensure Identical Conditions When Comparing
Bad: "Chip A (3.0 GHz): 15000 CoreMark
Chip B (2.5 GHz): 12000 CoreMark
Conclusion: A is faster"
Good: "Chip A: 5000 CoreMark/GHz
Chip B: 4800 CoreMark/GHz
Conclusion: At same frequency, A is 4% faster"
Normalize to per-MHz or per-watt for fair comparison.
Rule 4: Cross-Validate with Multiple Benchmarks
Back to my opening story—Chip A had higher DMIPS, but Chip B was faster in practice.
If we had run more benchmarks:
Chip A:
Dhrystone: 3.1 DMIPS/MHz
CoreMark: 3.2 CM/MHz
Memory BW: 1.5 GB/s
Chip B:
Dhrystone: 2.9 DMIPS/MHz
CoreMark: 3.0 CM/MHz
Memory BW: 3.2 GB/s ← Big difference here!
Chip B's memory bandwidth was 2× that of A. Our image processing pipeline was memory-bound, so B was faster.
A single benchmark is never enough.
Rule 5: Be Careful with Cross-Architecture Comparisons
Dhrystone/CoreMark scores across different CPU architectures can't be directly compared:
Typical DMIPS/MHz Reference Values (varies with compiler and optimization)
──────────────────────────────────────────────────────────────────────────
Architecture DMIPS/MHz CoreMark/MHz
──────────────────────────────────────────────────────────────────────────
ARM Cortex-M0+ 0.95 2.4
ARM Cortex-M3 1.25 3.3
ARM Cortex-M4 1.25 3.4
ARM Cortex-M7 2.14 5.0
ARM Cortex-A53 2.3 5.5
ARM Cortex-A72 4.7 8.0
──────────────────────────────────────────────────────────────────────────
RISC-V RV32IMC 1.2-1.8 2.5-3.5
SiFive E31 (RV32IMAC) 1.61 3.1
SiFive E76 (RV32IMAFC)2.36 4.5
SiFive U74 (RV64GC) 2.5 5.0
──────────────────────────────────────────────────────────────────────────
x86 Skylake ~5.0 ~8.0
x86 Zen 3 ~5.5 ~9.0
──────────────────────────────────────────────────────────────────────────
Note: These numbers are highly dependent on compiler version, optimization level, and ISA extensions. The same RISC-V core can vary 30% across different compilers.
Practical Tips for Running Benchmarks
Setting Up the Environment
# 1. Lock CPU frequency
sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -f 2.0GHz
# 2. Disable turbo
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
# 3. Pin to CPU
taskset -c 0 ./coremark
# 4. Set priority
sudo nice -n -20 ./coremark
Record Complete Environment
## Benchmark Environment
- **CPU**: Intel Core i7-10700 @ 2.9 GHz (locked)
- **Memory**: 32GB DDR4-3200
- **OS**: Ubuntu 22.04, kernel 5.15.0
- **Compiler**: GCC 11.2.0
- **Flags**: -O3 -march=native
- **CoreMark Version**: 1.01
- **Iterations**: 30000 (runtime ~10 seconds)
- **Date**: 2024-01-15
Run Multiple Times, Report Statistics
CoreMark Results (10 runs):
Mean: 24567.3 iterations/sec
StdDev: 123.4 (0.5%)
Min: 24312
Max: 24789
Back to That Image Processing Project
When we discovered Chip A had higher Dhrystone scores but worse real performance, I learned an important lesson:
Benchmarks are tools, not answers.
We ultimately chose Chip B because our application was memory-bound. If our application had been compute-bound, we might have chosen Chip A.
The correct approach is:
- First understand your workload characteristics (CPU-bound? Memory-bound? I/O-bound?)
- Choose appropriate benchmarks to evaluate
- Cross-validate with multiple benchmarks
- Finally, test on your actual application
No benchmark can replace testing on your actual application.
Summary
CPU benchmarks are tools for evaluating processor performance, but each has limitations:
Dhrystone
- Pros: Fast, universal, lots of historical data
- Cons: Too small, can be compiler-optimized, doesn't represent modern workloads
- Use for: Quick embedded system comparisons
CoreMark
- Pros: More modern than Dhrystone, anti-cheat design
- Cons: Still synthetic, still small
- Use for: Embedded systems, MCU evaluation
SPEC CPU
- Pros: Real applications, strict rules, industry standard
- Cons: Expensive, time-consuming, complex
- Use for: Formal server/desktop system evaluation
Correct Usage
- Know what each benchmark measures
- Don't just look at a single number
- Ensure identical comparison conditions
- Cross-validate with multiple benchmarks
- Ultimately test on your actual application