Chapter 1: Why Benchmarking Is Hard

Part I: Foundations

"There are three kinds of lies: lies, damned lies, and benchmarks." — Adapted from Benjamin Disraeli

The Meeting Before Launch

It was a Monday morning, two weeks before product launch. I sat in a conference room listening to Kevin, our marketing manager, describe what he needed.

"We need some performance numbers," he said, "for the datasheet and press release. Customers want to know how much faster our new chip is compared to the previous generation."

Fair enough. I was the performance engineer—this was literally my job.

"No problem," I said. "I can run some benchmarks. Should take about a week for a proper analysis."

Kevin frowned. "A week? We just need a few numbers. Tony got us the data in one day last time."

Tony was the engineer who handled the previous generation. He'd since left the company. I found his benchmark scripts and decided to run them first.

The results shocked me.

Those "Perfect" Numbers

Tony's scripts produced this output:

New Chip vs Old Chip Performance Comparison
============================================
Integer Operations:   +47%
Floating Point:       +62%
Memory Bandwidth:     +35%
Overall Score:        +48%

Too perfect. Every number showed improvement, and the gains were right in that "impressive but believable" range.

But I noticed a few oddities:

No variance data — Each test had exactly one number, no standard deviation
No environment description — No mention of test conditions
No raw data — Only final conclusions, no underlying measurements

I decided to dig deeper.

Running It Again

I re-ran the same benchmark ten times. The results:

Run 1:   +52%
Run 2:   +31%
Run 3:   +47%
Run 4:   +68%
Run 5:   +29%
Run 6:   +41%
Run 7:   +55%
Run 8:   +33%
Run 9:   +44%
Run 10:  +38%

The variance was enormous. The best run showed +68%, the worst +29%—more than a 2× difference.

Tony's reported +47% was within this range, but he'd only run it once and happened to hit a favorable number. This wasn't fraud, but it wasn't accurate either.

Worse, when I checked the test environments:

The new chip was tested in a 25°C air-conditioned lab
The old chip was tested in a 35°C regular office
The new chip had been freshly rebooted before testing
The old chip had been running for three days before testing

This wasn't a fair comparison at all.

I Can Make the Numbers Say Anything

That evening, I ran an experiment. I wanted to know: if I deliberately manipulated test conditions, how wide could I make the performance gap?

Best case (conditions favoring the new chip):

New chip: cold start, fresh reboot, all background processes killed, CPU frequency locked to maximum
Old chip: warm, running for days, multiple background processes, CPU in power-saving mode

Result: +89%

Worst case (conditions reversed):

Result: +12%

Same hardware, same benchmark program, and the performance difference ranged from +12% to +89% depending purely on how I set up the test environment.

This is what makes benchmarking terrifying: numbers don't lie, but numbers can be manipulated.

I Told Kevin the Truth

The next day, I scheduled a meeting with Kevin.

"I have good news and bad news," I said.

"Bad news first."

"The +47% figure isn't reliable. The test environments were inconsistent, and there was no statistical analysis. If we publish that number, tech journalists will tear it apart."

Kevin's face fell. "And the good news?"

"The good news is the new chip really is faster. Under controlled, fair testing conditions, the performance improvement is somewhere between +25% and +35%, with 95% confidence. That's a number we can defend."

Kevin was quiet for a moment. "+25% doesn't sound as impressive as +47%."

"But +25% is real. +47% was a lucky single run."

In the end, we used +30% (the middle of our confidence interval) and added a footnote to the datasheet describing our test methodology.

That decision taught me a lesson: honest benchmarks may not look as impressive, but at least they won't blow up in your face later.

Why Benchmarking Is So Hard

This experience taught me the fundamental challenge of benchmarking: too many factors affect measurement results, and our intuition consistently overlooks them.

Let me walk through the six major factors that influence benchmark results:

1. System Noise

Your computer never does just one thing. Background processes, kernel threads, and interrupt handlers are all competing for CPU time.

$ perf stat -r 10 ./my_benchmark

Performance counter stats for './my_benchmark' (10 runs):

    1,234,567 cycles    ( +- 15.2% )

System noise alone can cause 15% variance—and that's on a "quiet" system.

2. CPU Frequency Scaling

Modern CPUs don't run at fixed frequencies. They boost when cold, throttle when hot, and save power when idle.

Run 1 (cold):   1,000 μs @ 4.2 GHz
Run 2 (warm):   1,150 μs @ 3.8 GHz
Run 3 (hot):    1,400 μs @ 3.2 GHz

Statistics 101: Three Things You Must Know

After seeing these six factors, you understand why a single measurement isn't enough. Let me introduce three statistical concepts every performance engineer should know.

Mean: The Most Common Lie

Mean is the most commonly reported statistic—and often the most misleading.

Consider these two benchmark results:

Benchmark A: 100, 100, 100, 100, 100
Mean: 100 μs

Benchmark B: 50, 50, 50, 50, 300
Mean: 100 μs

Same mean, completely different behavior. Benchmark B has a tail latency problem, but the mean hides it.

Lesson: Never report just the mean. Always include variance or percentiles.

Variance and Standard Deviation

Variance measures how spread out your data is. Standard deviation (σ) is the square root of variance, with the same units as your measurement.

Benchmark A: σ = 0 μs (perfectly consistent)
Benchmark B: σ = 100 μs (high variance)

Rule of thumb: if σ exceeds 5% of the mean, your measurements are too noisy.

Confidence Intervals

When you say "my optimization is 15% faster," how sure are you?

A confidence interval tells you where the true value likely falls. A 95% confidence interval means: if you repeated this experiment 100 times, 95 of them would contain the true value.

Performance improvement: 15% (95% CI: 8% to 22%)

This says: "I'm 95% confident the real improvement is between 8% and 22%."

If your confidence interval crosses zero, you can't claim any improvement at all:

Performance improvement: 5% (95% CI: -3% to 13%)

That might just be noise, not signal.

How Many Runs Do I Need?

One of the most common questions: how many times should I run my benchmark?

The answer depends on variance. Here's a practical approach:

Step 1: Run 10 times, calculate the standard deviation.

Step 2: Use this formula to estimate the required sample size:

n = (z × σ / E)²

where:
  n = required sample size
  z = 1.96 (for 95% confidence)
  σ = standard deviation
  E = acceptable margin of error

Example: Your benchmark has σ = 100 μs, and you want the error within ±10 μs:

n = (1.96 × 100 / 10)² = 384 samples

You need about 400 runs to get reliable results.

Step 3: If you can't run that many, either relax your error margin or reduce variance by controlling the test environment.

Warm-up: The Hidden Requirement

Watch what happens when I run a benchmark 100 times consecutively and plot the results:

Run   Time (μs)
1     5,234    ← cold start
2     3,891
3     2,456
4     1,234
5     1,198
...
50    1,201
100   1,199    ← steady state

The first few runs are outliers—JIT compilation, cache warming, branch predictor training. These don't represent steady-state performance.

Solution: Always include warm-up runs, then discard that data:

// Warm-up phase (discard these)
for (int i = 0; i < WARMUP_RUNS; i++) {
    run_benchmark();
}

// Measurement phase (keep these)
for (int i = 0; i < MEASURED_RUNS; i++) {
    times[i] = run_benchmark();
}

How many warm-up runs? Enough that subsequent results stabilize. Plot the data—you'll see when it converges.

Back to That Meeting

Looking back at the pre-launch meeting, here's what proper analysis of Tony's data would have shown:

Tony's method (1 run):
  Result: +47%
  Confidence: Unknown
  Reproducibility: Unverified

Proper method (100 runs, controlled environment):
  Mean: +30%
  σ: 4.2%
  95% CI: [+25%, +35%]
  Reproducibility: ✓

+47% became +30%. Less impressive, but true.

More importantly, this number was defensible. When tech journalists or competitors challenged us, we could produce complete methodology and raw data.

That's the value of proper benchmarking: not prettier numbers, but trustworthy numbers.

Guidelines for Reliable Benchmarking

Based on these experiences, here are my guidelines:

Guideline 1: Control the Environment

# Disable CPU frequency scaling
sudo cpupower frequency-set -g performance

# Disable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

# Pin to a specific CPU core
taskset -c 2 ./benchmark  # bind to core 2

Guideline 2: Warm Up Before Measuring

Always discard the first runs. How many depends on your workload—measure until results stabilize.

Guideline 3: Report Variance, Not Just Mean

✗ Bad:   "Latency: 1.2 ms"
✓ Good:  "Latency: 1.2 ms (σ = 0.1 ms, n = 1000)"

Measure multiple times and compute variance
Warm up before measuring
Report uncertainty using confidence intervals
Control the environment to reduce noise
Stay skeptical of your own results

Performance and Benchmarking