Chapter 20: Benchmark Case Studies
Part V: Case Studies
“There are three kinds of lies: lies, damned lies, and benchmarks.” — Adapted from Mark Twain
It was 2:00 AM when Sarah Chen, lead architect at a processor startup, received the email that would change her company’s trajectory. A competitor had published a detailed technical analysis dismantling their flagship product’s performance claims. The headline was brutal: “Marketing Hype vs. Reality: How Vendor X Inflated Benchmark Scores by 300%.”
The problem wasn’t that their processor was slow—it was actually quite good. The problem was the benchmark they’d chosen to showcase it: Dhrystone. Their competitor had shown, line by line, how modern compilers could optimize away most of Dhrystone’s work, making the scores meaningless. Worse, they demonstrated that on real workloads—the kind customers actually run—the performance advantage evaporated.
Sarah spent the next week doing what she should have done months earlier: understanding what benchmarks actually measure. This chapter is the result of that investigation, examining two industry-standard benchmarks—Dhrystone and Coremark—to understand not just how to run them, but what they reveal about processor performance and, more importantly, what they hide.
20.1 Why Benchmarks Matter (and Why They Fail)
The Purpose of Benchmarks
In an ideal world, we’d measure processor performance by running every customer’s actual workload. In reality, we need standardized tests that:
- Represent real work: Reflect actual application behavior
- Are reproducible: Give consistent results across runs
- Are portable: Run on different architectures
- Are understandable: Clearly show what’s being measured
The challenge is that these goals often conflict. Make a benchmark too simple, and it doesn’t represent real work. Make it too complex, and it’s not reproducible or understandable.
How Benchmarks Fail
Benchmarks fail in predictable ways:
Compiler optimization: The compiler recognizes the benchmark pattern and optimizes it away. You’re measuring the compiler’s cleverness, not the processor’s performance.
Narrow workload: The benchmark tests only one aspect of performance (e.g., integer arithmetic) while real applications use a mix of operations.
Unrealistic data: The benchmark uses small, cache-friendly datasets while real applications work with large, scattered data.
Gaming the benchmark: Vendors optimize specifically for the benchmark, not for real workloads.
Let’s see how these failures manifest in practice.
20.2 Dhrystone: A Historical Lesson
Origins and Intent
Dhrystone was created in 1984 by Reinhold Weicker as a synthetic benchmark to measure integer performance. The name is a play on “Whetstone,” an earlier floating-point benchmark.
Design goals:
- Measure typical integer operations
- Be small enough to fit in cache
- Be simple to port
- Avoid floating-point (many embedded processors lacked FPUs)
Workload composition (from the original paper):
- 53% assignments
- 32% control flow (if/else, loops)
- 15% procedure calls
- String operations
- Record (struct) copying
What Dhrystone Actually Does
Let’s look at the core of Dhrystone (simplified):
typedef struct record {
struct record *ptr_comp;
int discr;
int enum_comp;
int int_comp;
char str_comp[31];
} Rec_Type, *Rec_Pointer;
void Proc_1(Rec_Pointer ptr_val_par) {
Rec_Pointer next_record = ptr_val_par->ptr_comp;
// Structure assignment
*ptr_val_par->ptr_comp = *ptr_val_par;
ptr_val_par->int_comp = 5;
next_record->int_comp = ptr_val_par->int_comp;
next_record->ptr_comp = ptr_val_par->ptr_comp;
// Procedure call
Proc_3(&next_record->ptr_comp);
// Conditional
if (next_record->discr == 0) {
next_record->int_comp = 6;
Proc_6(ptr_val_par->enum_comp, &next_record->enum_comp);
next_record->ptr_comp = ptr_val_par->ptr_comp;
Proc_7(next_record->int_comp, 10, &next_record->int_comp);
} else {
*ptr_val_par = *ptr_val_par->ptr_comp;
}
}
String operations:
void Proc_2(int *int_par_ref) {
int int_loc;
char char_loc;
int_loc = *int_par_ref + 10;
do {
if (Func_1('A', 'C') == 0) {
char_loc = 'A';
int_loc++;
}
} while (char_loc != 'A');
*int_par_ref = int_loc;
}
The Fatal Flaws
Problem 1: Dead Code Elimination
Modern compilers can prove that much of Dhrystone’s work has no observable effect:
// Compiler sees:
int x = 5;
x = x + 10;
x = x * 2;
// Result never used
// Compiler generates:
// (nothing - entire computation eliminated)
Problem 2: Constant Propagation
// Source code:
if (Func_1('A', 'C') == 0) {
// ...
}
// Compiler knows 'A' and 'C' are constants
// Evaluates Func_1 at compile time
// Replaces entire if statement with constant branch
Problem 3: Unrealistic Data Access
Dhrystone’s data fits entirely in L1 cache (a few KB). Real applications have cache misses. Dhrystone measures best-case performance, not typical performance.
Problem 4: No Pointer Chasing
While Dhrystone uses pointers, the access patterns are predictable. Modern processors prefetch the data before it’s needed.
The Compiler Optimization Disaster
Here’s what happens with -O3 optimization:
$ gcc -O0 dhrystone.c -o dhry_O0
$ gcc -O3 dhrystone.c -o dhry_O3
$ ./dhry_O0
Dhrystones per second: 500,000
$ ./dhry_O3
Dhrystones per second: 5,000,000
10x speedup from compiler flags alone! You’re not measuring the processor—you’re measuring the compiler’s ability to recognize and eliminate Dhrystone’s patterns.
Different compilers give wildly different results:
- GCC 10.2: 4.2 DMIPS/MHz
- Clang 12: 5.1 DMIPS/MHz
- ICC 21: 5.8 DMIPS/MHz
Same processor, different scores. The benchmark is broken.
What We Learn from Dhrystone
Dhrystone teaches us what not to do:
- ❌ Don’t use predictable, constant inputs
- ❌ Don’t allow dead code elimination
- ❌ Don’t use unrealistically small datasets
- ❌ Don’t focus on a single operation type
But it also teaches us what benchmarks should do—which brings us to Coremark.
20.3 Coremark: A Modern Approach
Design Philosophy
Coremark was created in 2009 by EEMBC (Embedded Microprocessor Benchmark Consortium) specifically to address Dhrystone’s flaws.
Design goals:
- Resist compiler optimization
- Represent diverse real-world operations
- Be portable across architectures
- Have clear, enforceable run rules
The Four Workloads
Coremark consists of four distinct workloads, each testing different aspects of processor performance:
Workload 1: Linked List Operations
typedef struct list_data_s {
int16_t data16;
int16_t idx;
} list_data;
typedef struct list_head_s {
struct list_head_s *next;
struct list_data_s *info;
} list_head;
// Find element in list
list_head *core_list_find(list_head *list, list_data *info) {
if (info->idx >= 0) {
while (list && (list->info->idx != info->idx))
list = list->next;
return list;
} else {
while (list && ((list->info->data16 & 0xff) != info->data16))
list = list->next;
return list;
}
}
// Reverse list
list_head *core_list_reverse(list_head *list) {
list_head *next = NULL, *tmp;
while (list) {
tmp = list->next;
list->next = next;
next = list;
list = tmp;
}
return next;
}
What it tests:
- Pointer chasing (cache misses)
- Unpredictable branches
- List traversal patterns (Chapter 5)
Why it resists optimization:
- List contents determined at runtime
- Search criteria varies
- Results are used (CRC’d at end)
Workload 2: Matrix Operations
typedef int16_t mat_elem;
typedef mat_elem *matrix_row;
// Matrix multiply (simplified)
void core_bench_matrix(mat_params *A, int16_t seed) {
uint32_t N = A->N;
matrix_row *C = A->C;
matrix_row *A_mat = A->A;
matrix_row *B = A->B;
// C = A * B
for (uint32_t i = 0; i < N; i++) {
for (uint32_t j = 0; j < N; j++) {
mat_elem temp = 0;
for (uint32_t k = 0; k < N; k++) {
temp += A_mat[i][k] * B[k][j];
}
C[i][j] = temp;
}
}
}
What it tests:
- Arithmetic intensity
- Cache blocking opportunities
- Memory access patterns (Chapter 4)
Why it resists optimization:
- Matrix size determined at runtime
- Results verified with checksum
- Multiple operations prevent constant folding
Workload 3: State Machine
enum CORE_STATE {
CORE_START = 0,
CORE_INVALID,
CORE_S1,
CORE_S2,
CORE_INT,
CORE_FLOAT,
CORE_EXPONENT,
CORE_SCIENTIFIC,
NUM_CORE_STATES
};
// State machine for parsing numbers
enum CORE_STATE core_state_transition(uint8_t **instr, uint32_t *transition_count) {
uint8_t *str = *instr;
uint8_t ch;
enum CORE_STATE state = CORE_START;
for (; *str && state != CORE_INVALID; str++) {
ch = *str;
(*transition_count)++;
switch (state) {
case CORE_START:
if (isdigit(ch)) {
state = CORE_INT;
} else if (ch == '+' || ch == '-') {
state = CORE_S1;
} else if (ch == '.') {
state = CORE_FLOAT;
} else {
state = CORE_INVALID;
}
break;
case CORE_S1:
if (isdigit(ch)) {
state = CORE_INT;
} else if (ch == '.') {
state = CORE_FLOAT;
} else {
state = CORE_INVALID;
}
break;
case CORE_INT:
if (ch == '.') {
state = CORE_FLOAT;
} else if (!isdigit(ch)) {
state = CORE_INVALID;
}
break;
// ... more states
}
}
*instr = str;
return state;
}
What it tests:
- Branch prediction
- Switch statement performance
- String processing (Chapter 14)
Why it resists optimization:
- Input strings vary
- State transitions unpredictable
- Transition count prevents elimination
Workload 4: CRC Calculation
uint16_t crcu16(uint16_t newval, uint16_t crc) {
uint8_t i;
for (i = 0; i < 16; i++) {
if ((crc & 0x8000) != 0) {
crc = (crc << 1) ^ 0x1021;
} else {
crc = crc << 1;
}
if ((newval & 0x8000) != 0) {
crc ^= 0x1021;
}
newval = newval << 1;
}
return crc;
}
// CRC all results
uint16_t core_bench_crc(void *memblock, uint32_t size) {
uint16_t crc = 0;
uint8_t *data = (uint8_t *)memblock;
for (uint32_t i = 0; i < size; i++) {
crc = crcu16(data[i], crc);
}
return crc;
}
What it tests:
- Bit manipulation
- Loop optimization
- Data-dependent operations (Chapter 13)
Why it resists optimization:
- CRC depends on all previous data
- Cannot be parallelized easily
- Result must match known value
Preventing Compiler Optimization
Coremark uses several techniques to prevent dead code elimination:
1. Runtime-determined inputs:
// Not this (compiler can optimize):
int data[100] = {1, 2, 3, ...};
// But this (runtime-determined):
void init_data(int *data, int seed) {
for (int i = 0; i < 100; i++) {
data[i] = (seed * i) & 0xFF;
seed = (seed * 1103515245 + 12345) & 0x7FFFFFFF;
}
}
2. Result verification:
// All results are CRC'd
uint16_t final_crc = 0;
final_crc = crcu16(list_result, final_crc);
final_crc = crcu16(matrix_result, final_crc);
final_crc = crcu16(state_result, final_crc);
// Must match known value
if (final_crc != EXPECTED_CRC) {
printf("ERROR: Invalid results!\n");
return -1;
}
3. Volatile results:
// Prevent optimization of result storage
volatile uint16_t results[4];
results[0] = list_crc;
results[1] = matrix_crc;
results[2] = state_crc;
results[3] = crc_crc;
Run Rules
Coremark has strict run rules to ensure fair comparison:
- Minimum iterations: Must run for at least 10 seconds
- No source modifications: Core algorithms cannot be changed
- Validation: Results must match known CRC values
- Reporting: Must report iterations/second and iterations/MHz
- Compiler flags: Must be disclosed
Example valid run:
CoreMark 1.0 : 12500.00 / GCC 10.2.0 -O3 -march=rv64gc / Heap
CoreMark/MHz: 5.00
20.4 Performance Analysis
Understanding the Scores
Dhrystone reports DMIPS (Dhrystone MIPS):
- DMIPS = (Dhrystones/sec) / 1757
- 1757 is the score of a VAX 11/780 (the reference)
- DMIPS/MHz normalizes for clock frequency
Coremark reports iterations/second:
- Higher is better
- CoreMark/MHz normalizes for clock frequency
- Typical range: 2.5-5.5 CoreMark/MHz
What Affects Coremark Scores?
1. Compiler optimization:
# -O0 (no optimization)
CoreMark/MHz: 1.2
# -O2 (standard optimization)
CoreMark/MHz: 4.5
# -O3 (aggressive optimization)
CoreMark/MHz: 5.0
# -O3 -funroll-loops
CoreMark/MHz: 5.2
2. ISA extensions:
# RV64GC (base)
CoreMark/MHz: 4.8
# RV64GC + B extension (bit manipulation)
CoreMark/MHz: 5.1
# RV64GC + V extension (vector) - scalar mode
CoreMark/MHz: 5.0
3. Cache configuration:
16 KB I$ + 16 KB D$: 4.2 CoreMark/MHz
32 KB I$ + 32 KB D$: 4.8 CoreMark/MHz
64 KB I$ + 64 KB D$: 5.0 CoreMark/MHz
4. Memory latency:
SRAM (1 cycle): 5.2 CoreMark/MHz
DRAM (100 cycles): 3.8 CoreMark/MHz
Typical Scores (Public Data)
Based on EEMBC’s published results and academic papers:
Embedded processors (RV32):
- Simple in-order: 2.5-3.0 CoreMark/MHz
- With caches: 3.0-3.5 CoreMark/MHz
Application processors (RV64):
- In-order, single-issue: 3.5-4.0 CoreMark/MHz
- In-order, dual-issue: 4.0-4.5 CoreMark/MHz
- Out-of-order: 4.5-5.5 CoreMark/MHz
For comparison (x86/ARM):
- ARM Cortex-A53: 3.5 CoreMark/MHz
- ARM Cortex-A72: 4.5 CoreMark/MHz
- Intel Atom: 4.0 CoreMark/MHz
- Intel Core i7: 5.0+ CoreMark/MHz
What Coremark Doesn’t Measure
Coremark is better than Dhrystone, but it’s not perfect:
Missing workloads:
- ❌ Floating-point operations
- ❌ Vector/SIMD operations
- ❌ System calls
- ❌ I/O operations
- ❌ Multi-threading
Unrealistic aspects:
- Small dataset (fits in cache)
- No OS overhead
- No interrupts
- Deterministic execution
What it measures well:
- ✅ Integer arithmetic
- ✅ Pointer chasing
- ✅ Branch prediction
- ✅ Compiler effectiveness
- ✅ Cache performance (for small datasets)
20.5 Benchmark Design Principles
Lessons from History
Comparing Dhrystone and Coremark teaches us how to design good benchmarks:
| Principle | Dhrystone | Coremark |
|---|---|---|
| Diverse workloads | ❌ Mostly assignments | ✅ 4 distinct workloads |
| Resist optimization | ❌ Easily optimized | ✅ Multiple techniques |
| Runtime inputs | ❌ Compile-time constants | ✅ Seed-based generation |
| Result verification | ❌ Weak | ✅ CRC validation |
| Run rules | ❌ Informal | ✅ Strict, enforceable |
| Portability | ✅ Good | ✅ Excellent |
| Understandability | ✅ Simple | ⚠️ More complex |
Designing Your Own Benchmark
When you need to create a benchmark for your specific use case:
1. Identify the workload:
// Don't benchmark generic "performance"
// Benchmark specific operations:
// ❌ Too generic
void benchmark_processor(void);
// ✅ Specific workload
void benchmark_packet_processing(void);
void benchmark_image_filtering(void);
void benchmark_crypto_operations(void);
2. Use realistic data:
// ❌ Unrealistic
int data[100] = {1, 2, 3, 4, ...}; // Fits in cache
// ✅ Realistic
#define DATA_SIZE (1024 * 1024) // 1 MB
int *data = malloc(DATA_SIZE * sizeof(int));
init_random_data(data, DATA_SIZE, seed);
3. Prevent optimization:
// ❌ Compiler can eliminate
int sum = 0;
for (int i = 0; i < n; i++) {
sum += data[i];
}
// sum never used
// ✅ Force computation
volatile int result;
int sum = 0;
for (int i = 0; i < n; i++) {
sum += data[i];
}
result = sum; // Volatile prevents elimination
4. Validate results:
// ✅ Checksum validation
uint32_t expected_crc = compute_expected_crc(seed);
uint32_t actual_crc = run_benchmark(data, size);
if (actual_crc != expected_crc) {
fprintf(stderr, "ERROR: Benchmark validation failed!\n");
fprintf(stderr, "Expected: 0x%08x, Got: 0x%08x\n",
expected_crc, actual_crc);
return -1;
}
5. Report methodology:
printf("=== Benchmark Results ===\n");
printf("Workload: Packet processing\n");
printf("Data size: %d packets\n", num_packets);
printf("Iterations: %d\n", iterations);
printf("Compiler: %s %s\n", COMPILER_NAME, COMPILER_VERSION);
printf("Flags: %s\n", COMPILER_FLAGS);
printf("Time: %.2f ms\n", elapsed_ms);
printf("Throughput: %.2f Mpps\n", packets_per_sec / 1e6);
Common Pitfalls
Pitfall 1: Measuring the wrong thing:
// ❌ Measures malloc, not computation
start_timer();
int *data = malloc(size);
compute(data, size);
free(data);
stop_timer();
// ✅ Measures only computation
int *data = malloc(size);
start_timer();
compute(data, size);
stop_timer();
free(data);
Pitfall 2: Insufficient warm-up:
// ❌ First run includes cold cache
for (int i = 0; i < 100; i++) {
start_timer();
benchmark();
stop_timer();
}
// ✅ Warm up first
for (int i = 0; i < 10; i++) {
benchmark(); // Warm-up, don't measure
}
for (int i = 0; i < 100; i++) {
start_timer();
benchmark();
stop_timer();
}
Pitfall 3: Ignoring variance:
// ❌ Single measurement
double time = measure_once();
printf("Time: %.2f ms\n", time);
// ✅ Statistical analysis
double times[100];
for (int i = 0; i < 100; i++) {
times[i] = measure_once();
}
printf("Mean: %.2f ms\n", mean(times, 100));
printf("Median: %.2f ms\n", median(times, 100));
printf("Std dev: %.2f ms\n", stddev(times, 100));
printf("Min: %.2f ms\n", min(times, 100));
printf("Max: %.2f ms\n", max(times, 100));
20.6 Summary
Key Takeaways
Dhrystone is obsolete:
- Modern compilers optimize away most of the work
- Scores vary wildly between compilers
- Doesn’t represent real workloads
- Use only for historical comparison
Coremark is better, but not perfect:
- Resists compiler optimization through multiple techniques
- Represents diverse integer workloads
- Has strict, enforceable run rules
- But: small dataset, no FP/SIMD, no OS overhead
Benchmark design principles:
- Use diverse, realistic workloads
- Prevent dead code elimination
- Use runtime-determined inputs
- Validate results
- Report full methodology
- Understand limitations
Benchmarks are tools, not goals:
- A high Coremark score doesn’t guarantee good performance on your workload
- Understand what the benchmark measures
- Supplement with application-specific benchmarks
- Profile real applications
The Bigger Picture
This chapter examined two benchmarks in detail, but the lessons apply broadly:
From Chapter 3 (Benchmarking): Statistical rigor matters. Run multiple iterations, report variance, control for confounding factors.
From Chapter 2 (Memory Hierarchy): Cache behavior dominates performance. Benchmarks with unrealistic data access patterns (like Dhrystone) miss this.
From Chapters 5, 11, 13, 14: Real applications use diverse data structures. Good benchmarks (like Coremark) test multiple patterns.
Looking forward: As you design systems, remember that optimization targets matter. Optimizing for a benchmark is easy. Optimizing for real workloads—with their messy, unpredictable access patterns and diverse operations—is the real challenge.
Practical Advice
When evaluating processors:
- Look beyond the headline number
- Ask: “What benchmark? What compiler? What flags?”
- Run your own workload if possible
- Understand the benchmark’s limitations
When designing benchmarks:
- Start with real application traces
- Identify the critical operations
- Create a minimal reproducible test
- Validate against the real application
- Document everything
When reporting results:
- Full disclosure: hardware, compiler, flags
- Statistical analysis: mean, median, variance
- Methodology: warm-up, iterations, validation
- Limitations: what the benchmark doesn’t measure
Sarah Chen’s company learned these lessons the hard way. After the public embarrassment, they switched to Coremark and, more importantly, developed application-specific benchmarks based on actual customer workloads. Their next product launch focused not on benchmark scores, but on real-world performance improvements—and customers noticed.
The best benchmark is the one that matches your workload. Everything else is just a proxy.