Chapter 33: How to Benchmark
Part IX: Synthesis
"The only thing worse than no data is bad data." — W. Edwards Deming (attributed)
The Perfect Checklist
The story happened on a Friday afternoon.
My colleague Emily had just joined the performance analysis team. She spent an entire week running benchmarks and prepared a professional-looking report: beautiful charts, detailed data, clear conclusions.
"This is the performance analysis of the new version," she said confidently. "It's 23% faster than the old version."
I looked at her report and asked one question: "How many times did you run it?"
"Once," she said. "The results were stable, and the charts look nice."
"Did you do warm-up?"
"What's warm-up?"
I sighed. This is a mistake every newcomer makes. Not because they're not smart, but because benchmarking looks too simple—run a program once, record the time, done.
But in reality, correct benchmarking is a science that requires rigorous methodology.
This chapter consolidates everything we've learned in the previous 15 chapters into a complete "How to Benchmark" guide.
The Benchmarking Checklist
Years of experience tell me that good benchmarks need to answer these questions:
┌─────────────────────────────────────────────────────────────────┐
│ Benchmarking Checklist │
├─────────────────────────────────────────────────────────────────┤
│ □ 1. What are you measuring? (clearly define the metric) │
│ □ 2. Is the environment controlled? (fixed freq, no turbo) │
│ □ 3. Did you warm up? (let cache, branch predictor stabilize) │
│ □ 4. How many runs? (N ≥ 10, preferably ≥ 30) │
│ □ 5. Are statistics complete? (median, stddev, CI) │
│ □ 6. Are results reproducible? (can someone else get same data) │
│ □ 7. Is comparison fair? (same env, same load, same method) │
└─────────────────────────────────────────────────────────────────┘
Let's analyze each one.
Step 1: Clearly Define What You're Measuring
This sounds obvious, but it's the most commonly overlooked step.
Wrong example:
"I want to test how fast my program is."
This sentence is meaningless. What does "fast" mean?
Correct example:
"I want to measure the average latency (in nanoseconds)
of a single hash table lookup with 10,000 key-value pairs."
A clear metric definition should include:
| Element | Example |
|---|---|
| Operation | Hash table lookup |
| Scale | 10,000 entries |
| Unit | Nanoseconds per lookup |
| Statistic | Median with 95% confidence interval |
Step 2: Control the Test Environment
Environmental variation is the main source of unstable benchmark results.
Linux Environment Setup
# 1. Fix CPU frequency
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# 2. Isolate CPU cores (avoid scheduler interference)
# In grub config: isolcpus=2,3
taskset -c 2 ./benchmark # Bind to isolated core
# 3. Disable ASLR (reduce variance from address randomization)
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
# 4. Clear page cache (if testing I/O)
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
Environment Checklist
□ CPU frequency: Fixed (not powersave/ondemand)
□ Turbo boost: Disabled
□ Hyperthreading: Decide based on test purpose
□ Background processes: Minimized
□ NUMA: Confirm memory affinity
□ Temperature: Stable (avoid thermal throttling)
Step 3: Warm-up — The Most Forgotten Step
The first time any code executes, there's a lot of "cold start" overhead:
- Instruction cache miss: Code not yet loaded into cache
- Data cache miss: Data not yet loaded into cache
- Branch predictor: Hasn't learned branch patterns yet
- Page fault: Memory pages not yet mapped
- JIT compilation: If VM language, first run needs compilation
If you measure first execution time, you're measuring "cold start performance," not "steady state performance."
Warm-up Strategy
#define WARMUP_ITERATIONS 1000
#define MEASURED_ITERATIONS 10000
void benchmark(void) {
// Phase 1: Warm-up (discard these results)
for (int i = 0; i < WARMUP_ITERATIONS; i++) {
operation_under_test();
}
// Phase 2: Measurement
uint64_t times[MEASURED_ITERATIONS];
for (int i = 0; i < MEASURED_ITERATIONS; i++) {
uint64_t start = get_cycles();
operation_under_test();
uint64_t end = get_cycles();
times[i] = end - start;
## Step 4: Statistics — Running Once Is Not Enough
This is the mistake Emily made: running only once.
**Why isn't once enough?**
Even in a perfectly controlled environment, measurements still have variance:
- Minor OS scheduler interference
- Cache state differences
- Hardware timer precision limits
- Power management adjustments
### Minimum Sample Size
| Purpose | Recommended Sample Size (N) |
|---------|----------------------------|
| Quick check | N ≥ 10 |
| Formal report | N ≥ 30 |
| Publication | N ≥ 100 |
### Choose the Right Statistics
```text
❌ Only report mean
"Average latency: 150 ns"
✅ Report complete statistics
"Latency: median = 145 ns, mean = 152 ns
stddev = 23 ns, 95% CI = [141, 163] ns
min = 120 ns, max = 310 ns"
Why use median instead of mean?
Outliers affect mean dramatically. If 99 measurements are 100 ns, but 1 is 10,000 ns (due to context switch), mean gets severely skewed. Median is immune to outliers.
Step 5: What If Variance Is Too High?
If your coefficient of variation (CV = stddev / mean) exceeds 5%, results may be unreliable.
Diagnostic Steps
1. Check environment
- Is CPU frequency changing?
- Are background processes running?
- Is thermal throttling occurring?
2. Check program
- Is there dynamic memory allocation? (malloc/free has high variance)
- Are there I/O operations?
- Are there system calls?
3. Check measurement method
- Is timer resolution sufficient?
- Is there timer wrap-around?
Techniques to Reduce Variance
// 1. Use inline assembly barrier to prevent compiler reordering
#define COMPILER_BARRIER() asm volatile("" ::: "memory")
// 2. Use CPU cycle counter instead of wall clock
static inline uint64_t rdtsc(void) {
uint32_t lo, hi;
asm volatile("rdtscp" : "=a"(lo), "=d"(hi) :: "rcx");
return ((uint64_t)hi << 32) | lo;
}
// 3. Pin thread to specific CPU
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(2, &cpuset); // Use CPU 2
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
Step 6: Fair Comparison
When comparing two systems or algorithms, you must ensure "all else being equal."
Common Unfair Comparisons
| Trap | Problem |
|---|---|
| Different compilers | GCC vs Clang optimize differently |
| Different optimization levels | -O0 vs -O3 huge difference |
| Different data sizes | Small data fits in cache, large doesn't |
| Different hardware | Comparing different CPUs needs normalization |
| Different warm-up | One has warm-up, one doesn't |
Correct Approach
# Ensure same compilation environment
gcc --version # Record version
CFLAGS="-O3 -march=native" # Same optimization options
# Ensure same execution environment
uname -r # Record kernel version
cat /proc/cpuinfo | grep "model name" # Record CPU
# Ensure same test data
sha256sum test_data.bin # Verify data integrity
Step 7: Document Everything
Your report should allow another person to reproduce your results.
Report Template
## Test Environment
### Hardware
- CPU: Intel Core i7-12700K @ 3.6 GHz (fixed, turbo disabled)
- Memory: 32 GB DDR5-4800
- Storage: Samsung 980 Pro NVMe
### Software
- OS: Ubuntu 22.04 LTS (kernel 6.5.0)
- Compiler: GCC 12.3.0
- Flags: -O3 -march=native -flto
### Environment Settings
- CPU governor: performance
- Turbo boost: disabled
- Hyperthreading: disabled
- ASLR: disabled
- Isolated cores: 2-3
## Methodology
- Warm-up: 1,000 iterations
- Measured: 10,000 iterations
- Repetitions: 30 independent runs
- Statistics: median with 95% CI
## Results
| Metric | Value | 95% CI |
|--------|-------|--------|
| Latency | 145 ns | ±8 ns |
| Throughput | 6.9 M ops/s | ±0.3 M |
## Reproduction Steps
git clone <repo> && cd benchmark
./setup_env.sh # Setup environment
./run_benchmark.sh # Run tests
The Anti-Patterns — Things to Never Do
1. Cherry-picking
❌ Ran 10 times, only report the best one
✅ Report statistical summary of all results
2. Hiding Variance
❌ "5% faster" (when variance is actually 20%)
✅ "5% ± 3% faster, statistically significant"
3. Unfair Baseline
❌ Compare your optimized program vs competitor's default config
✅ Both use default config, or both are optimized
4. Ignoring Cold Start
❌ Measurement includes first execution (cache miss, page fault)
✅ Clearly distinguish cold start and steady state performance
Emily's Story Ending
I helped Emily redesign her benchmark:
- Added warm-up: 1000 iterations
- Increased sample size: From 1 to 30 runs
- Fixed environment: Fixed CPU frequency, disabled turbo
- Calculated statistics: median, stddev, CI
New results:
Old conclusion: "New version is 23% faster"
New conclusion: "New version median latency reduced by 18%
(95% CI: 15% - 21%)
All results consistent across N=30 measurements
Performance improvement statistically significant (p < 0.001)"
The number got smaller, but more credible.
Summary
The Benchmarking Checklist:
- Define clearly: What metric, what scale, what unit
- Control environment: Fixed frequency, isolated cores, minimal noise
- Warm up: Don't measure cold start unless that's what you want
- Run enough times: N ≥ 30 for serious work
- Report statistics: Median, stddev, confidence interval
- Compare fairly: Same environment, same workload, same methodology
- Document everything: Make it reproducible
The Golden Rule:
If someone cannot reproduce your benchmark results, your benchmark is worthless.
Next chapter, we'll discuss how to systematically perform performance optimization once you have correct data.