Chapter 3: Measurement Methodology

Part I: Foundations

"Not everything that can be counted counts, and not everything that counts can be counted." — William Bruce Cameron

A Tale of Two Timers

It was a simple enough task: measure how long a function takes to execute.

My colleague Lisa used gettimeofday():

struct timeval start, end;
gettimeofday(&start, NULL);
do_something();
gettimeofday(&end, NULL);
long elapsed = (end.tv_sec - start.tv_sec) * 1000000 +
               (end.tv_usec - start.tv_usec);
printf("Time: %ld μs\n", elapsed);

I used clock_gettime(CLOCK_MONOTONIC):

struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
do_something();
clock_gettime(CLOCK_MONOTONIC, &end);
long elapsed = (end.tv_sec - start.tv_sec) * 1000000000L +
               (end.tv_nsec - start.tv_nsec);
printf("Time: %ld ns\n", elapsed);

We measured the same function but got different results.

Lisa's results bounced between 1,000 and 1,200 microseconds. Mine stayed stable around 1,050 microseconds.

Stranger still: after a sysadmin adjusted NTP (Network Time Protocol), Lisa's measurement came back negative—the function ended before it started.

"That's impossible," she said. "Time doesn't run backwards."

But the timer she was using can.

Wall Clock vs. Monotonic Clock

This is the first lesson in understanding timers: not all clocks are suitable for measuring elapsed time.

Wall Clock (`gettimeofday`, `CLOCK_REALTIME`)

A wall clock represents "what time is it right now?" It can be adjusted by NTP, changed manually by users, or affected by daylight saving time.

12:00:00.000  - Start measurement
12:00:01.000  - NTP adjusts time back to 11:59:59.500
11:59:59.600  - End measurement

Elapsed time = -0.4 seconds (negative!)

Wall clocks are for recording when something happened, not how long it took.

Monotonic Clock (`CLOCK_MONOTONIC`)

A monotonic clock is guaranteed to only move forward. It won't be adjusted by NTP (or will only be slewed gradually, never jumping).

Monotonic: 1000.000 - Start measurement
Monotonic: 1001.050 - End measurement

Elapsed time = 1.050 seconds (always positive)

Rule: When measuring elapsed time, always use a monotonic clock.

Timer Precision vs. Timer Resolution

Another common confusion: precision versus resolution.

Resolution: Smallest Reportable Unit

struct timespec res;
clock_getres(CLOCK_MONOTONIC, &res);
printf("Resolution: %ld ns\n", res.tv_nsec);
// Typically outputs: 1 ns

This tells you the timer can report values in 1-nanosecond increments. But this does not mean you can accurately measure 1-nanosecond intervals.

Precision: What You Can Actually Trust

Calling the timer itself takes time:

// Measure timer overhead
struct timespec t1, t2;
clock_gettime(CLOCK_MONOTONIC, &t1);
clock_gettime(CLOCK_MONOTONIC, &t2);
printf("Timer overhead: %ld ns\n",
       (t2.tv_sec - t1.tv_sec) * 1000000000L + t2.tv_nsec - t1.tv_nsec);

On my system, this overhead is about 20-50 nanoseconds. This means:

Measuring a 1 μs event: 2-5% error
Measuring a 100 ns event: 20-50% error
Measuring a 10 ns event: completely unreliable

Rule: The interval you're measuring should be at least 100× the timer overhead for less than 1% error.

CPU Cycles: The Most Precise Timer

When you need to measure very short intervals—tens to hundreds of CPU cycles—read the cycle counter directly.

x86: RDTSC

#include <x86intrin.h>

uint64_t start = __rdtsc();
do_something_very_fast();
uint64_t end = __rdtsc();

printf("Cycles: %lu\n", end - start);

ARM: CNTVCT_EL0

uint64_t read_cycles(void) {
    uint64_t val;
    asm volatile("mrs %0, cntvct_el0" : "=r" (val));
    return val;
}

RISC-V: rdcycle

uint64_t read_cycles(void) {
    uint64_t val;
    asm volatile("rdcycle %0" : "=r" (val));
    return val;
}

Handling Outliers

Real benchmark data will have outliers—extreme high or low values. The question is: should you remove them?

Why Outliers Happen

Interrupts: System interrupts preempt your program
Context switches: OS swaps your program out
Page faults: First access to a new memory page
GC pauses: If your language has garbage collection
Thermal throttling: CPU overheats and slows down

Two Approaches

Conservative: Keep all data

If your goal is understanding "real world" behavior, outliers are part of reality.

# Report full percentiles
p50 = np.percentile(data, 50)
p90 = np.percentile(data, 90)
p99 = np.percentile(data, 99)
p999 = np.percentile(data, 99.9)

This approach is common for latency SLAs ("99% of requests complete within 10ms").

Aggressive: Remove outliers

If your goal is understanding algorithm performance, outliers are noise.

# Remove outliers beyond 3 standard deviations
mean = np.mean(data)
std = np.std(data)
filtered = [x for x in data if abs(x - mean) < 3 * std]

Or use a more robust method:

# Use IQR (Interquartile Range)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
filtered = [x for x in data if q1 - 1.5*iqr <= x <= q3 + 1.5*iqr]

My recommendation: Do both. Report filtered results as the primary data, but also report full percentiles so readers can see the outlier situation.

Statistical Significance: Is Your Improvement Real?

You made an optimization. Before: 1,000 μs. After: 980 μs. Is that a 2% improvement?

Not necessarily. It might just be noise.

Hypothesis Testing

The formal approach uses hypothesis testing:

Null hypothesis (H0): No difference between groups
Alternative hypothesis (H1): There is a difference
Run a statistical test (e.g., t-test)
Check p-value: If p < 0.05, reject H0

from scipy import stats

before = [1000, 1020, 980, 1010, 990, ...]
after = [980, 970, 990, 960, 985, ...]

t_stat, p_value = stats.ttest_ind(before, after)
print(f"p-value: {p_value}")

if p_value < 0.05:
    print("Difference is statistically significant")
else:
    print("Difference might be noise")

Effect Size

Even if a difference is "statistically significant," it doesn't mean it's "important." With 10,000 samples, even a 0.1% difference can be "significant."

Effect size tells you how big the difference is:

# Cohen's d
def cohens_d(before, after):
    pooled_std = np.sqrt((np.std(before)**2 + np.std(after)**2) / 2)
    return (np.mean(after) - np.mean(before)) / pooled_std

d = cohens_d(before, after)
# |d| < 0.2: small
# 0.2 <= |d| < 0.8: medium
# |d| >= 0.8: large

Practical Significance

Finally, ask yourself: does this improvement matter to users?

1,000 μs → 980 μs (2%): Probably unnoticeable
100 ms → 50 ms (50%): Users will feel it
10 s → 5 s (50%): Users will thank you

Rule: Statistical significance is necessary but not sufficient. You also need practical significance.

Warm-up Revisited

In Chapter 1 we mentioned warm-up. Here's a deeper look at determining how many warm-up iterations you need.

Visual Method: Plot It

The most intuitive approach is plotting each run's time:

Run#  Time (μs)
1     5,234    **********************
2     3,891    ****************
3     2,456    **********
4     1,334    *****
5     1,256    *****
6     1,243    *****
7     1,238    *****
...
50    1,241    *****

You can see runs 1-4 are warm-up; from run 5 onward, results stabilize.

Automatic Method: CV Convergence

Calculate the rolling coefficient of variation (CV). When it stabilizes, warm-up is complete:

def find_warmup(times, threshold=0.05, window=10):
    """
    Find where warm-up ends.
    When rolling CV drops below threshold, consider it stable.
    """
    for i in range(window, len(times)):
        recent = times[i-window:i]
        cv = np.std(recent) / np.mean(recent)
        if cv < threshold:
            return i - window
    return len(times) // 2  # fallback

Rules of Thumb

CPU-bound computation: Usually 3-5 warm-up runs (JIT, cache warming)
Memory-intensive workload: May need 10-20 runs (page tables, TLB)
JIT languages (Java, JavaScript): May need hundreds to thousands

Important: Always verify your assumptions. Plot the data—don't blindly trust rules of thumb.

A Simple Measurement Framework

Let's put all these concepts together into a reusable framework:

#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>

typedef struct {
    double *samples;
    size_t count;
    double mean;
    double std;
    double median;
    double p95;
    double p99;
} BenchmarkResult;

double get_time_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec * 1e9 + ts.tv_nsec;
}

int compare_double(const void *a, const void *b) {
    double da = *(const double *)a;
    double db = *(const double *)b;
    return (da > db) - (da < db);
}

void analyze_result(BenchmarkResult *r) {
    qsort(r->samples, r->count, sizeof(double), compare_double);

    // Mean
    double sum = 0;
    for (size_t i = 0; i < r->count; i++) sum += r->samples[i];
    r->mean = sum / r->count;

    // Standard deviation
    double sq_sum = 0;
    for (size_t i = 0; i < r->count; i++) {
        sq_sum += (r->samples[i] - r->mean) * (r->samples[i] - r->mean);
    }
    r->std = sqrt(sq_sum / r->count);

    // Percentiles
    r->median = r->samples[r->count / 2];
    r->p95 = r->samples[(size_t)(r->count * 0.95)];
    r->p99 = r->samples[(size_t)(r->count * 0.99)];
}

void print_result(BenchmarkResult *r, const char *name) {
    printf("%s:\n", name);
    printf("  Mean:   %.2f ns (σ = %.2f, CV = %.2f%%)\n",
           r->mean, r->std, (r->std / r->mean) * 100);
    printf("  Median: %.2f ns\n", r->median);
    printf("  P95:    %.2f ns, P99: %.2f ns\n", r->p95, r->p99);
}

Usage:

#define WARMUP 100
#define RUNS 1000

void benchmark_example(void) {
    BenchmarkResult result;
    result.samples = malloc(RUNS * sizeof(double));
    result.count = RUNS;

    // Warm-up (discard)
    for (int i = 0; i < WARMUP; i++) do_something();

    // Measurement
    for (int i = 0; i < RUNS; i++) {
        double start = get_time_ns();
        do_something();
        result.samples[i] = get_time_ns() - start;
    }

    analyze_result(&result);
    print_result(&result, "do_something()");
    free(result.samples);
}

Back to Lisa's Problem

Remember Lisa getting negative elapsed time with gettimeofday()?

That day she learned:

Use the right timer: CLOCK_MONOTONIC won't go backwards due to NTP
Understand timer overhead: Don't try to measure intervals too short
CPU time vs. wall time: Choose based on your goal

A week later, she wrote a complete benchmark framework that became our team's standard tool.

"I had no idea timers were this complicated," she said.

"They are," I said. "But once you understand the pitfalls, you can write benchmarks that anyone can trust."

Summary

Correct measurement methodology is the core of reliable benchmarking. This chapter covered:

Timer Selection

Use CLOCK_MONOTONIC, not wall clocks
Understand timer overhead; don't measure intervals too short
For cycle-level precision, use CPU cycle counters

CPU Time vs. Wall Time

Wall time: user-perceived latency
CPU time: actual CPU usage
Their ratio tells you CPU utilization

Outlier Handling

Conservative: keep all data, report percentiles
Aggressive: remove outliers, but document your method
Recommendation: do both

Statistical Significance

Use hypothesis testing to confirm differences are real
Check effect size to confirm differences matter
Consider practical significance

Warm-up

Discard unstable initial runs
Use visual or automatic methods to determine warm-up count
Always verify your assumptions

Performance and Benchmarking