Chapter 2: Setting Up Your Measurement Environment
Part I: Foundations
"Measure what is measurable, and make measurable what is not so." — Galileo Galilei
The Unreproducible Bug
"I ran it ten times, and I got a different number each time."
This was the first thing Jason, a new engineer I was mentoring, said during his first performance analysis task. He was measuring a sorting algorithm, and his results varied by 40% between runs.
"What's your measurement environment?" I asked.
"Just my laptop."
I glanced at his screen—Slack was open, Chrome had twenty-something tabs, Spotify was playing music, and a Docker container was running in the background.
"That's your problem," I said. "You're not measuring your program. You're measuring your program plus Slack, plus Chrome, plus Spotify, plus Docker, plus whatever mood your laptop is in."
System Noise: The Invisible Enemy
Modern operating systems are time-shared. Even when you're running a single program, the OS is doing many things in the background:
- Kernel threads: Memory management, I/O scheduling, network processing
- Interrupts: Hardware interrupts from network cards, USB, timers
- Background daemons: Cron jobs, logging, file indexing
- Power management: CPU frequency scaling, thermal throttling
This "noise" steals CPU cycles from your program—unpredictably.
Let me show you a simple experiment. This is the same benchmark run 100 times on a "busy" system:
Run Time (μs) Notes
1 1,234
2 1,198
3 5,678 ← Context switch?
4 1,201
5 1,245
...
47 8,901 ← Interrupt storm?
...
100 1,199
Mean: 1,456 μs, but median: only 1,215 μs. Those outliers completely distort the average.
Step 1: Reduce System Noise
To get reproducible measurements, you must first reduce noise sources.
Kill Unnecessary Programs
This is basic, but often overlooked:
# Check what's running
ps aux | head -20
# Kill the usual CPU hogs
pkill chrome
pkill slack
pkill spotify
pkill docker
Disable Background Services
# Stop cron
sudo systemctl stop cron
# Stop logging daemon (careful—this stops system logs)
sudo systemctl stop rsyslog
# Stop indexing services
sudo systemctl stop tracker-miner-fs # GNOME
sudo systemctl stop mlocate # updatedb
Set Up CPU Isolation
Linux can reserve certain CPU cores exclusively for your benchmark, preventing the kernel from scheduling other work on them:
# Add to boot parameters (requires reboot)
isolcpus=2,3
# Or use cgroups for dynamic isolation
sudo cset shield -c 2,3 -k on
Then pin your benchmark to these isolated cores:
taskset -c 2 ./my_benchmark
CPU Frequency: The Hidden Variable
Modern CPUs don't run at fixed frequencies. They boost when cold, throttle when hot, and save power when idle. Great for laptops, terrible for benchmarking.
Turbo Boost: Friend or Foe?
When CPU load is low, the processor can temporarily overclock. But as temperature rises, frequency drops back down.
Time (s) Frequency Benchmark Time
0 4.2 GHz 952 μs
10 4.0 GHz 1,000 μs
30 3.6 GHz 1,111 μs
60 3.2 GHz 1,250 μs
Same benchmark, but later runs are 30% slower due to thermal throttling.
Solution: Lock the CPU Frequency
# Check current frequency governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Set to performance mode (maximum frequency)
sudo cpupower frequency-set -g performance
# Or set a specific frequency
sudo cpupower frequency-set -f 2.0GHz
# Disable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
# Or for Intel:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
Important: Locking to "maximum frequency" isn't always best. If your benchmark runs long, the CPU may thermal-throttle anyway. Choose a frequency the system can sustain indefinitely.
Verify Frequency Stability
# Monitor CPU frequency in real-time
watch -n 0.5 "cat /proc/cpuinfo | grep MHz"
# Or use turbostat for detailed stats
sudo turbostat --interval 1
Cache State: Cold vs. Warm
CPU cache is another hidden variable. The same program can be 10× slower with a cold cache than a warm one.
The Problem
Consider this simple array sum:
long sum_array(int *arr, size_t n) {
long sum = 0;
for (size_t i = 0; i < n; i++) {
sum += arr[i];
}
return sum;
}
On first execution, the data isn't in cache. Every access goes to main memory:
First run (cold cache): 5,234 μs
Second run (warm cache): 523 μs
A 10× difference! If you only measure once, which result did you capture?
Solution: Explicitly Choose Cold or Warm
Option 1: Measure warm cache (steady state)
This is usually what you want—performance during normal operation:
// Warm-up runs (discard results)
for (int i = 0; i < WARMUP; i++) {
sum_array(arr, n);
}
// Measurement runs (record results)
for (int i = 0; i < RUNS; i++) {
times[i] = measure(sum_array, arr, n);
}
Option 2: Measure cold cache (worst case)
Sometimes you need first-run performance, like startup latency:
for (int i = 0; i < RUNS; i++) {
// Flush cache
flush_cache();
// Measure cold cache performance
times[i] = measure(sum_array, arr, n);
}
How to flush the cache:
// Method 1: Use clflush instruction (need to know addresses)
void flush_array(void *ptr, size_t size) {
char *p = (char *)ptr;
for (size_t i = 0; i < size; i += 64) { // 64 = cache line size
_mm_clflush(p + i);
}
_mm_mfence();
}
// Method 2: Access a large "trash" array to evict old data
void evict_cache(void) {
static char trash[32 * 1024 * 1024]; // 32MB > L3 cache
volatile char sum = 0;
for (size_t i = 0; i < sizeof(trash); i += 64) {
sum += trash[i];
}
}
Key principle: Whatever you choose, be explicit and consistent. Don't mix approaches between runs.
ASLR and Memory Layout
Address Space Layout Randomization (ASLR) is a security feature. Each time you run a program, the addresses of the stack, heap, and shared libraries are randomized.
How does this affect performance measurements?
Cache Conflicts
CPU caches use certain address bits to determine which cache set data belongs to. If your data structures happen to land at addresses that conflict, cache efficiency drops dramatically.
Because of ASLR, the same program may have different cache behavior on each run.
Solutions
Option 1: Disable ASLR
# Temporarily disable (affects current shell only)
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
# Disable for a specific program only
setarch $(uname -m) -R ./my_benchmark
Option 2: Run enough iterations to average it out
If you run enough times, ASLR's effects average out. But this requires more runs and produces higher variance.
My recommendation: Disable ASLR for benchmarking. We want reproducibility, not security.
NUMA: The Multi-Socket Trap
If you're benchmarking on a multi-socket server, there's another pitfall: NUMA (Non-Uniform Memory Access).
In NUMA systems, each CPU socket has its own "local" memory. Accessing local memory is fast; accessing remote memory is slow.
CPU 0 accessing local memory: 100 ns
CPU 0 accessing remote memory: 300 ns
If your program runs on CPU 0 but its data is allocated in CPU 1's memory, performance tanks.
Solution: Pin Both CPU and Memory
# Bind to node 0's CPUs and memory
numactl --cpunodebind=0 --membind=0 ./my_benchmark
# Or use interleave mode (distribute evenly)
numactl --interleave=all ./my_benchmark
Putting It All Together: Benchmark Environment Checklist
Based on everything above, here's my benchmark environment setup script:
#!/bin/bash
# benchmark_setup.sh - Create a reproducible benchmark environment
echo "=== Setting up benchmark environment ==="
# 1. Check for root privileges
if [ "$EUID" -ne 0 ]; then
echo "Please run as root"
exit 1
fi
# 2. Stop background services
echo "Stopping background services..."
systemctl stop cron
systemctl stop rsyslog
systemctl stop NetworkManager # if networking not needed
# 3. Set CPU frequency
echo "Setting CPU frequency..."
cpupower frequency-set -g performance
echo 0 > /sys/devices/system/cpu/cpufreq/boost
# 4. Disable ASLR
echo "Disabling ASLR..."
echo 0 > /proc/sys/kernel/randomize_va_space
# 5. Show CPU isolation status
echo "CPU isolation: $(cat /sys/devices/system/cpu/isolated)"
# 6. Display current status
echo ""
echo "=== Current status ==="
echo "CPU frequency: $(cat /proc/cpuinfo | grep MHz | head -1)"
echo "Turbo boost: $(cat /sys/devices/system/cpu/cpufreq/boost 2>/dev/null || echo 'N/A')"
echo "ASLR: $(cat /proc/sys/kernel/randomize_va_space)"
echo "Isolated CPUs: $(cat /sys/devices/system/cpu/isolated)"
echo ""
echo "Ready for benchmarking!"
echo "Remember to run: taskset -c <isolated_cpu> ./your_benchmark"
Back to Jason's Problem
Remember Jason's 40% variance?
I had him make these changes:
- Close all unnecessary programs
- Lock CPU frequency
- Add warm-up runs
- Pin to a specific CPU core
Results:
Before:
Mean: 1,234 μs
Std Dev: 512 μs (41.5%)
After:
Mean: 1,198 μs
Std Dev: 12 μs (1.0%)
Variance dropped from 41.5% to 1.0%. Now that's a measurement you can trust.
"So there was nothing wrong with my program," Jason said. "It was my measurement environment."
"Exactly," I said. "You've just learned the most important lesson in performance analysis: Before you measure your program, measure your measurement environment."
Summary
A reliable measurement environment is the foundation of correct benchmarking. This chapter covered:
System Noise
- Kill unnecessary programs and background services
- Use CPU isolation to reduce kernel interference
- Pin benchmarks to fixed CPU cores with
taskset
CPU Frequency
- Lock frequency to avoid turbo boost and thermal throttling
- Choose a frequency the system can sustain
- Verify stability throughout the entire test
Cache State
- Explicitly choose cold cache or warm cache measurement
- Be consistent—don't mix approaches
ASLR and NUMA
- Disable ASLR for reproducibility
- On NUMA systems, bind both CPU and memory to the same node