Chapter 33: How to Benchmark

Part IX: Synthesis


"The only thing worse than no data is bad data." — W. Edwards Deming (attributed)

The Perfect Checklist

The story happened on a Friday afternoon.

My colleague Emily had just joined the performance analysis team. She spent an entire week running benchmarks and prepared a professional-looking report: beautiful charts, detailed data, clear conclusions.

"This is the performance analysis of the new version," she said confidently. "It's 23% faster than the old version."

I looked at her report and asked one question: "How many times did you run it?"

"Once," she said. "The results were stable, and the charts look nice."

"Did you do warm-up?"

"What's warm-up?"

I sighed. This is a mistake every newcomer makes. Not because they're not smart, but because benchmarking looks too simple—run a program once, record the time, done.

But in reality, correct benchmarking is a science that requires rigorous methodology.

This chapter consolidates everything we've learned in the previous 15 chapters into a complete "How to Benchmark" guide.

The Benchmarking Checklist

Years of experience tell me that good benchmarks need to answer these questions:

┌─────────────────────────────────────────────────────────────────┐
│                   Benchmarking Checklist                        │
├─────────────────────────────────────────────────────────────────┤
│ □ 1. What are you measuring? (clearly define the metric)        │
│ □ 2. Is the environment controlled? (fixed freq, no turbo)      │
│ □ 3. Did you warm up? (let cache, branch predictor stabilize)   │
│ □ 4. How many runs? (N ≥ 10, preferably ≥ 30)                   │
│ □ 5. Are statistics complete? (median, stddev, CI)              │
│ □ 6. Are results reproducible? (can someone else get same data) │
│ □ 7. Is comparison fair? (same env, same load, same method)     │
└─────────────────────────────────────────────────────────────────┘

Let's analyze each one.

Step 1: Clearly Define What You're Measuring

This sounds obvious, but it's the most commonly overlooked step.

Wrong example:

"I want to test how fast my program is."

This sentence is meaningless. What does "fast" mean?

Correct example:

"I want to measure the average latency (in nanoseconds)
 of a single hash table lookup with 10,000 key-value pairs."

A clear metric definition should include:

ElementExample
OperationHash table lookup
Scale10,000 entries
UnitNanoseconds per lookup
StatisticMedian with 95% confidence interval

Step 2: Control the Test Environment

Environmental variation is the main source of unstable benchmark results.

Linux Environment Setup

# 1. Fix CPU frequency
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

# 2. Isolate CPU cores (avoid scheduler interference)
# In grub config: isolcpus=2,3
taskset -c 2 ./benchmark  # Bind to isolated core

# 3. Disable ASLR (reduce variance from address randomization)
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

# 4. Clear page cache (if testing I/O)
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

Environment Checklist

□ CPU frequency: Fixed (not powersave/ondemand)
□ Turbo boost: Disabled
□ Hyperthreading: Decide based on test purpose
□ Background processes: Minimized
□ NUMA: Confirm memory affinity
□ Temperature: Stable (avoid thermal throttling)

Step 3: Warm-up — The Most Forgotten Step

The first time any code executes, there's a lot of "cold start" overhead:

  • Instruction cache miss: Code not yet loaded into cache
  • Data cache miss: Data not yet loaded into cache
  • Branch predictor: Hasn't learned branch patterns yet
  • Page fault: Memory pages not yet mapped
  • JIT compilation: If VM language, first run needs compilation

If you measure first execution time, you're measuring "cold start performance," not "steady state performance."

Warm-up Strategy

#define WARMUP_ITERATIONS 1000
#define MEASURED_ITERATIONS 10000

void benchmark(void) {
    // Phase 1: Warm-up (discard these results)
    for (int i = 0; i < WARMUP_ITERATIONS; i++) {
        operation_under_test();
    }

    // Phase 2: Measurement
    uint64_t times[MEASURED_ITERATIONS];
    for (int i = 0; i < MEASURED_ITERATIONS; i++) {
        uint64_t start = get_cycles();
        operation_under_test();
        uint64_t end = get_cycles();
        times[i] = end - start;

## Step 4: Statistics — Running Once Is Not Enough

This is the mistake Emily made: running only once.

**Why isn't once enough?**

Even in a perfectly controlled environment, measurements still have variance:

- Minor OS scheduler interference
- Cache state differences
- Hardware timer precision limits
- Power management adjustments

### Minimum Sample Size

| Purpose | Recommended Sample Size (N) |
|---------|----------------------------|
| Quick check | N ≥ 10 |
| Formal report | N ≥ 30 |
| Publication | N ≥ 100 |

### Choose the Right Statistics

```text
❌ Only report mean
   "Average latency: 150 ns"

✅ Report complete statistics
   "Latency: median = 145 ns, mean = 152 ns
    stddev = 23 ns, 95% CI = [141, 163] ns
    min = 120 ns, max = 310 ns"

Why use median instead of mean?

Outliers affect mean dramatically. If 99 measurements are 100 ns, but 1 is 10,000 ns (due to context switch), mean gets severely skewed. Median is immune to outliers.

Step 5: What If Variance Is Too High?

If your coefficient of variation (CV = stddev / mean) exceeds 5%, results may be unreliable.

Diagnostic Steps

1. Check environment
   - Is CPU frequency changing?
   - Are background processes running?
   - Is thermal throttling occurring?

2. Check program
   - Is there dynamic memory allocation? (malloc/free has high variance)
   - Are there I/O operations?
   - Are there system calls?

3. Check measurement method
   - Is timer resolution sufficient?
   - Is there timer wrap-around?

Techniques to Reduce Variance

// 1. Use inline assembly barrier to prevent compiler reordering
#define COMPILER_BARRIER() asm volatile("" ::: "memory")

// 2. Use CPU cycle counter instead of wall clock
static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    asm volatile("rdtscp" : "=a"(lo), "=d"(hi) :: "rcx");
    return ((uint64_t)hi << 32) | lo;
}

// 3. Pin thread to specific CPU
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(2, &cpuset);  // Use CPU 2
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

Step 6: Fair Comparison

When comparing two systems or algorithms, you must ensure "all else being equal."

Common Unfair Comparisons

TrapProblem
Different compilersGCC vs Clang optimize differently
Different optimization levels-O0 vs -O3 huge difference
Different data sizesSmall data fits in cache, large doesn't
Different hardwareComparing different CPUs needs normalization
Different warm-upOne has warm-up, one doesn't

Correct Approach

# Ensure same compilation environment
gcc --version  # Record version
CFLAGS="-O3 -march=native"  # Same optimization options

# Ensure same execution environment
uname -r       # Record kernel version
cat /proc/cpuinfo | grep "model name"  # Record CPU

# Ensure same test data
sha256sum test_data.bin  # Verify data integrity

Step 7: Document Everything

Your report should allow another person to reproduce your results.

Report Template

## Test Environment

### Hardware
- CPU: Intel Core i7-12700K @ 3.6 GHz (fixed, turbo disabled)
- Memory: 32 GB DDR5-4800
- Storage: Samsung 980 Pro NVMe

### Software
- OS: Ubuntu 22.04 LTS (kernel 6.5.0)
- Compiler: GCC 12.3.0
- Flags: -O3 -march=native -flto

### Environment Settings
- CPU governor: performance
- Turbo boost: disabled
- Hyperthreading: disabled
- ASLR: disabled
- Isolated cores: 2-3

## Methodology
- Warm-up: 1,000 iterations
- Measured: 10,000 iterations
- Repetitions: 30 independent runs
- Statistics: median with 95% CI

## Results

| Metric | Value | 95% CI |
|--------|-------|--------|
| Latency | 145 ns | ±8 ns |
| Throughput | 6.9 M ops/s | ±0.3 M |

## Reproduction Steps
git clone <repo> && cd benchmark
./setup_env.sh    # Setup environment
./run_benchmark.sh  # Run tests

The Anti-Patterns — Things to Never Do

1. Cherry-picking

❌ Ran 10 times, only report the best one
✅ Report statistical summary of all results

2. Hiding Variance

❌ "5% faster" (when variance is actually 20%)
✅ "5% ± 3% faster, statistically significant"

3. Unfair Baseline

❌ Compare your optimized program vs competitor's default config
✅ Both use default config, or both are optimized

4. Ignoring Cold Start

❌ Measurement includes first execution (cache miss, page fault)
✅ Clearly distinguish cold start and steady state performance

Emily's Story Ending

I helped Emily redesign her benchmark:

  1. Added warm-up: 1000 iterations
  2. Increased sample size: From 1 to 30 runs
  3. Fixed environment: Fixed CPU frequency, disabled turbo
  4. Calculated statistics: median, stddev, CI

New results:

Old conclusion: "New version is 23% faster"

New conclusion: "New version median latency reduced by 18%
                (95% CI: 15% - 21%)
                All results consistent across N=30 measurements
                Performance improvement statistically significant (p < 0.001)"

The number got smaller, but more credible.

Summary

The Benchmarking Checklist:

  1. Define clearly: What metric, what scale, what unit
  2. Control environment: Fixed frequency, isolated cores, minimal noise
  3. Warm up: Don't measure cold start unless that's what you want
  4. Run enough times: N ≥ 30 for serious work
  5. Report statistics: Median, stddev, confidence interval
  6. Compare fairly: Same environment, same workload, same methodology
  7. Document everything: Make it reproducible

The Golden Rule:

If someone cannot reproduce your benchmark results, your benchmark is worthless.

Next chapter, we'll discuss how to systematically perform performance optimization once you have correct data.