Chapter 2: Setting Up Your Measurement Environment

Part I: Foundations


"Measure what is measurable, and make measurable what is not so." — Galileo Galilei

The Unreproducible Bug

"I ran it ten times, and I got a different number each time."

This was the first thing Jason, a new engineer I was mentoring, said during his first performance analysis task. He was measuring a sorting algorithm, and his results varied by 40% between runs.

"What's your measurement environment?" I asked.

"Just my laptop."

I glanced at his screen—Slack was open, Chrome had twenty-something tabs, Spotify was playing music, and a Docker container was running in the background.

"That's your problem," I said. "You're not measuring your program. You're measuring your program plus Slack, plus Chrome, plus Spotify, plus Docker, plus whatever mood your laptop is in."

System Noise: The Invisible Enemy

Modern operating systems are time-shared. Even when you're running a single program, the OS is doing many things in the background:

  • Kernel threads: Memory management, I/O scheduling, network processing
  • Interrupts: Hardware interrupts from network cards, USB, timers
  • Background daemons: Cron jobs, logging, file indexing
  • Power management: CPU frequency scaling, thermal throttling

This "noise" steals CPU cycles from your program—unpredictably.

Let me show you a simple experiment. This is the same benchmark run 100 times on a "busy" system:

Run    Time (μs)    Notes
1      1,234
2      1,198
3      5,678        ← Context switch?
4      1,201
5      1,245
...
47     8,901        ← Interrupt storm?
...
100    1,199

Mean: 1,456 μs, but median: only 1,215 μs. Those outliers completely distort the average.

Step 1: Reduce System Noise

To get reproducible measurements, you must first reduce noise sources.

Kill Unnecessary Programs

This is basic, but often overlooked:

# Check what's running
ps aux | head -20

# Kill the usual CPU hogs
pkill chrome
pkill slack
pkill spotify
pkill docker

Disable Background Services

# Stop cron
sudo systemctl stop cron

# Stop logging daemon (careful—this stops system logs)
sudo systemctl stop rsyslog

# Stop indexing services
sudo systemctl stop tracker-miner-fs  # GNOME
sudo systemctl stop mlocate           # updatedb

Set Up CPU Isolation

Linux can reserve certain CPU cores exclusively for your benchmark, preventing the kernel from scheduling other work on them:

# Add to boot parameters (requires reboot)
isolcpus=2,3

# Or use cgroups for dynamic isolation
sudo cset shield -c 2,3 -k on

Then pin your benchmark to these isolated cores:

taskset -c 2 ./my_benchmark

CPU Frequency: The Hidden Variable

Modern CPUs don't run at fixed frequencies. They boost when cold, throttle when hot, and save power when idle. Great for laptops, terrible for benchmarking.

Turbo Boost: Friend or Foe?

When CPU load is low, the processor can temporarily overclock. But as temperature rises, frequency drops back down.

Time (s)    Frequency    Benchmark Time
0           4.2 GHz      952 μs
10          4.0 GHz      1,000 μs
30          3.6 GHz      1,111 μs
60          3.2 GHz      1,250 μs

Same benchmark, but later runs are 30% slower due to thermal throttling.

Solution: Lock the CPU Frequency

# Check current frequency governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set to performance mode (maximum frequency)
sudo cpupower frequency-set -g performance

# Or set a specific frequency
sudo cpupower frequency-set -f 2.0GHz

# Disable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
# Or for Intel:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

Important: Locking to "maximum frequency" isn't always best. If your benchmark runs long, the CPU may thermal-throttle anyway. Choose a frequency the system can sustain indefinitely.

Verify Frequency Stability

# Monitor CPU frequency in real-time
watch -n 0.5 "cat /proc/cpuinfo | grep MHz"

# Or use turbostat for detailed stats
sudo turbostat --interval 1

Cache State: Cold vs. Warm

CPU cache is another hidden variable. The same program can be 10× slower with a cold cache than a warm one.

The Problem

Consider this simple array sum:

long sum_array(int *arr, size_t n) {
    long sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += arr[i];
    }
    return sum;
}

On first execution, the data isn't in cache. Every access goes to main memory:

First run (cold cache):   5,234 μs
Second run (warm cache):  523 μs

A 10× difference! If you only measure once, which result did you capture?

Solution: Explicitly Choose Cold or Warm

Option 1: Measure warm cache (steady state)

This is usually what you want—performance during normal operation:

// Warm-up runs (discard results)
for (int i = 0; i < WARMUP; i++) {
    sum_array(arr, n);
}

// Measurement runs (record results)
for (int i = 0; i < RUNS; i++) {
    times[i] = measure(sum_array, arr, n);
}

Option 2: Measure cold cache (worst case)

Sometimes you need first-run performance, like startup latency:

for (int i = 0; i < RUNS; i++) {
    // Flush cache
    flush_cache();

    // Measure cold cache performance
    times[i] = measure(sum_array, arr, n);
}

How to flush the cache:

// Method 1: Use clflush instruction (need to know addresses)
void flush_array(void *ptr, size_t size) {
    char *p = (char *)ptr;
    for (size_t i = 0; i < size; i += 64) {  // 64 = cache line size
        _mm_clflush(p + i);
    }
    _mm_mfence();
}

// Method 2: Access a large "trash" array to evict old data
void evict_cache(void) {
    static char trash[32 * 1024 * 1024];  // 32MB > L3 cache
    volatile char sum = 0;
    for (size_t i = 0; i < sizeof(trash); i += 64) {
        sum += trash[i];
    }
}

Key principle: Whatever you choose, be explicit and consistent. Don't mix approaches between runs.

ASLR and Memory Layout

Address Space Layout Randomization (ASLR) is a security feature. Each time you run a program, the addresses of the stack, heap, and shared libraries are randomized.

How does this affect performance measurements?

Cache Conflicts

CPU caches use certain address bits to determine which cache set data belongs to. If your data structures happen to land at addresses that conflict, cache efficiency drops dramatically.

Because of ASLR, the same program may have different cache behavior on each run.

Solutions

Option 1: Disable ASLR

# Temporarily disable (affects current shell only)
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

# Disable for a specific program only
setarch $(uname -m) -R ./my_benchmark

Option 2: Run enough iterations to average it out

If you run enough times, ASLR's effects average out. But this requires more runs and produces higher variance.

My recommendation: Disable ASLR for benchmarking. We want reproducibility, not security.

NUMA: The Multi-Socket Trap

If you're benchmarking on a multi-socket server, there's another pitfall: NUMA (Non-Uniform Memory Access).

In NUMA systems, each CPU socket has its own "local" memory. Accessing local memory is fast; accessing remote memory is slow.

CPU 0 accessing local memory:   100 ns
CPU 0 accessing remote memory:  300 ns

If your program runs on CPU 0 but its data is allocated in CPU 1's memory, performance tanks.

Solution: Pin Both CPU and Memory

# Bind to node 0's CPUs and memory
numactl --cpunodebind=0 --membind=0 ./my_benchmark

# Or use interleave mode (distribute evenly)
numactl --interleave=all ./my_benchmark

Putting It All Together: Benchmark Environment Checklist

Based on everything above, here's my benchmark environment setup script:

#!/bin/bash
# benchmark_setup.sh - Create a reproducible benchmark environment

echo "=== Setting up benchmark environment ==="

# 1. Check for root privileges
if [ "$EUID" -ne 0 ]; then
    echo "Please run as root"
    exit 1
fi

# 2. Stop background services
echo "Stopping background services..."
systemctl stop cron
systemctl stop rsyslog
systemctl stop NetworkManager  # if networking not needed

# 3. Set CPU frequency
echo "Setting CPU frequency..."
cpupower frequency-set -g performance
echo 0 > /sys/devices/system/cpu/cpufreq/boost

# 4. Disable ASLR
echo "Disabling ASLR..."
echo 0 > /proc/sys/kernel/randomize_va_space

# 5. Show CPU isolation status
echo "CPU isolation: $(cat /sys/devices/system/cpu/isolated)"

# 6. Display current status
echo ""
echo "=== Current status ==="
echo "CPU frequency: $(cat /proc/cpuinfo | grep MHz | head -1)"
echo "Turbo boost: $(cat /sys/devices/system/cpu/cpufreq/boost 2>/dev/null || echo 'N/A')"
echo "ASLR: $(cat /proc/sys/kernel/randomize_va_space)"
echo "Isolated CPUs: $(cat /sys/devices/system/cpu/isolated)"

echo ""
echo "Ready for benchmarking!"
echo "Remember to run: taskset -c <isolated_cpu> ./your_benchmark"

Back to Jason's Problem

Remember Jason's 40% variance?

I had him make these changes:

  1. Close all unnecessary programs
  2. Lock CPU frequency
  3. Add warm-up runs
  4. Pin to a specific CPU core

Results:

Before:
  Mean: 1,234 μs
  Std Dev: 512 μs (41.5%)

After:
  Mean: 1,198 μs
  Std Dev: 12 μs (1.0%)

Variance dropped from 41.5% to 1.0%. Now that's a measurement you can trust.

"So there was nothing wrong with my program," Jason said. "It was my measurement environment."

"Exactly," I said. "You've just learned the most important lesson in performance analysis: Before you measure your program, measure your measurement environment."

Summary

A reliable measurement environment is the foundation of correct benchmarking. This chapter covered:

System Noise

  • Kill unnecessary programs and background services
  • Use CPU isolation to reduce kernel interference
  • Pin benchmarks to fixed CPU cores with taskset

CPU Frequency

  • Lock frequency to avoid turbo boost and thermal throttling
  • Choose a frequency the system can sustain
  • Verify stability throughout the entire test

Cache State

  • Explicitly choose cold cache or warm cache measurement
  • Be consistent—don't mix approaches

ASLR and NUMA

  • Disable ASLR for reproducibility
  • On NUMA systems, bind both CPU and memory to the same node