Chapter 3: Measurement Methodology
Part I: Foundations
"Not everything that can be counted counts, and not everything that counts can be counted." — William Bruce Cameron
A Tale of Two Timers
It was a simple enough task: measure how long a function takes to execute.
My colleague Lisa used gettimeofday():
struct timeval start, end;
gettimeofday(&start, NULL);
do_something();
gettimeofday(&end, NULL);
long elapsed = (end.tv_sec - start.tv_sec) * 1000000 +
(end.tv_usec - start.tv_usec);
printf("Time: %ld μs\n", elapsed);
I used clock_gettime(CLOCK_MONOTONIC):
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
do_something();
clock_gettime(CLOCK_MONOTONIC, &end);
long elapsed = (end.tv_sec - start.tv_sec) * 1000000000L +
(end.tv_nsec - start.tv_nsec);
printf("Time: %ld ns\n", elapsed);
We measured the same function but got different results.
Lisa's results bounced between 1,000 and 1,200 microseconds. Mine stayed stable around 1,050 microseconds.
Stranger still: after a sysadmin adjusted NTP (Network Time Protocol), Lisa's measurement came back negative—the function ended before it started.
"That's impossible," she said. "Time doesn't run backwards."
But the timer she was using can.
Wall Clock vs. Monotonic Clock
This is the first lesson in understanding timers: not all clocks are suitable for measuring elapsed time.
Wall Clock (gettimeofday, CLOCK_REALTIME)
A wall clock represents "what time is it right now?" It can be adjusted by NTP, changed manually by users, or affected by daylight saving time.
12:00:00.000 - Start measurement
12:00:01.000 - NTP adjusts time back to 11:59:59.500
11:59:59.600 - End measurement
Elapsed time = -0.4 seconds (negative!)
Wall clocks are for recording when something happened, not how long it took.
Monotonic Clock (CLOCK_MONOTONIC)
A monotonic clock is guaranteed to only move forward. It won't be adjusted by NTP (or will only be slewed gradually, never jumping).
Monotonic: 1000.000 - Start measurement
Monotonic: 1001.050 - End measurement
Elapsed time = 1.050 seconds (always positive)
Rule: When measuring elapsed time, always use a monotonic clock.
Timer Precision vs. Timer Resolution
Another common confusion: precision versus resolution.
Resolution: Smallest Reportable Unit
struct timespec res;
clock_getres(CLOCK_MONOTONIC, &res);
printf("Resolution: %ld ns\n", res.tv_nsec);
// Typically outputs: 1 ns
This tells you the timer can report values in 1-nanosecond increments. But this does not mean you can accurately measure 1-nanosecond intervals.
Precision: What You Can Actually Trust
Calling the timer itself takes time:
// Measure timer overhead
struct timespec t1, t2;
clock_gettime(CLOCK_MONOTONIC, &t1);
clock_gettime(CLOCK_MONOTONIC, &t2);
printf("Timer overhead: %ld ns\n",
(t2.tv_sec - t1.tv_sec) * 1000000000L + t2.tv_nsec - t1.tv_nsec);
On my system, this overhead is about 20-50 nanoseconds. This means:
- Measuring a 1 μs event: 2-5% error
- Measuring a 100 ns event: 20-50% error
- Measuring a 10 ns event: completely unreliable
Rule: The interval you're measuring should be at least 100× the timer overhead for less than 1% error.
CPU Cycles: The Most Precise Timer
When you need to measure very short intervals—tens to hundreds of CPU cycles—read the cycle counter directly.
x86: RDTSC
#include <x86intrin.h>
uint64_t start = __rdtsc();
do_something_very_fast();
uint64_t end = __rdtsc();
printf("Cycles: %lu\n", end - start);
ARM: CNTVCT_EL0
uint64_t read_cycles(void) {
uint64_t val;
asm volatile("mrs %0, cntvct_el0" : "=r" (val));
return val;
}
RISC-V: rdcycle
uint64_t read_cycles(void) {
uint64_t val;
asm volatile("rdcycle %0" : "=r" (val));
return val;
}
Handling Outliers
Real benchmark data will have outliers—extreme high or low values. The question is: should you remove them?
Why Outliers Happen
- Interrupts: System interrupts preempt your program
- Context switches: OS swaps your program out
- Page faults: First access to a new memory page
- GC pauses: If your language has garbage collection
- Thermal throttling: CPU overheats and slows down
Two Approaches
Conservative: Keep all data
If your goal is understanding "real world" behavior, outliers are part of reality.
# Report full percentiles
p50 = np.percentile(data, 50)
p90 = np.percentile(data, 90)
p99 = np.percentile(data, 99)
p999 = np.percentile(data, 99.9)
This approach is common for latency SLAs ("99% of requests complete within 10ms").
Aggressive: Remove outliers
If your goal is understanding algorithm performance, outliers are noise.
# Remove outliers beyond 3 standard deviations
mean = np.mean(data)
std = np.std(data)
filtered = [x for x in data if abs(x - mean) < 3 * std]
Or use a more robust method:
# Use IQR (Interquartile Range)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
filtered = [x for x in data if q1 - 1.5*iqr <= x <= q3 + 1.5*iqr]
My recommendation: Do both. Report filtered results as the primary data, but also report full percentiles so readers can see the outlier situation.
Statistical Significance: Is Your Improvement Real?
You made an optimization. Before: 1,000 μs. After: 980 μs. Is that a 2% improvement?
Not necessarily. It might just be noise.
Hypothesis Testing
The formal approach uses hypothesis testing:
- Null hypothesis (H0): No difference between groups
- Alternative hypothesis (H1): There is a difference
- Run a statistical test (e.g., t-test)
- Check p-value: If p < 0.05, reject H0
from scipy import stats
before = [1000, 1020, 980, 1010, 990, ...]
after = [980, 970, 990, 960, 985, ...]
t_stat, p_value = stats.ttest_ind(before, after)
print(f"p-value: {p_value}")
if p_value < 0.05:
print("Difference is statistically significant")
else:
print("Difference might be noise")
Effect Size
Even if a difference is "statistically significant," it doesn't mean it's "important." With 10,000 samples, even a 0.1% difference can be "significant."
Effect size tells you how big the difference is:
# Cohen's d
def cohens_d(before, after):
pooled_std = np.sqrt((np.std(before)**2 + np.std(after)**2) / 2)
return (np.mean(after) - np.mean(before)) / pooled_std
d = cohens_d(before, after)
# |d| < 0.2: small
# 0.2 <= |d| < 0.8: medium
# |d| >= 0.8: large
Practical Significance
Finally, ask yourself: does this improvement matter to users?
- 1,000 μs → 980 μs (2%): Probably unnoticeable
- 100 ms → 50 ms (50%): Users will feel it
- 10 s → 5 s (50%): Users will thank you
Rule: Statistical significance is necessary but not sufficient. You also need practical significance.
Warm-up Revisited
In Chapter 1 we mentioned warm-up. Here's a deeper look at determining how many warm-up iterations you need.
Visual Method: Plot It
The most intuitive approach is plotting each run's time:
Run# Time (μs)
1 5,234 **********************
2 3,891 ****************
3 2,456 **********
4 1,334 *****
5 1,256 *****
6 1,243 *****
7 1,238 *****
...
50 1,241 *****
You can see runs 1-4 are warm-up; from run 5 onward, results stabilize.
Automatic Method: CV Convergence
Calculate the rolling coefficient of variation (CV). When it stabilizes, warm-up is complete:
def find_warmup(times, threshold=0.05, window=10):
"""
Find where warm-up ends.
When rolling CV drops below threshold, consider it stable.
"""
for i in range(window, len(times)):
recent = times[i-window:i]
cv = np.std(recent) / np.mean(recent)
if cv < threshold:
return i - window
return len(times) // 2 # fallback
Rules of Thumb
- CPU-bound computation: Usually 3-5 warm-up runs (JIT, cache warming)
- Memory-intensive workload: May need 10-20 runs (page tables, TLB)
- JIT languages (Java, JavaScript): May need hundreds to thousands
Important: Always verify your assumptions. Plot the data—don't blindly trust rules of thumb.
A Simple Measurement Framework
Let's put all these concepts together into a reusable framework:
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
typedef struct {
double *samples;
size_t count;
double mean;
double std;
double median;
double p95;
double p99;
} BenchmarkResult;
double get_time_ns(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1e9 + ts.tv_nsec;
}
int compare_double(const void *a, const void *b) {
double da = *(const double *)a;
double db = *(const double *)b;
return (da > db) - (da < db);
}
void analyze_result(BenchmarkResult *r) {
qsort(r->samples, r->count, sizeof(double), compare_double);
// Mean
double sum = 0;
for (size_t i = 0; i < r->count; i++) sum += r->samples[i];
r->mean = sum / r->count;
// Standard deviation
double sq_sum = 0;
for (size_t i = 0; i < r->count; i++) {
sq_sum += (r->samples[i] - r->mean) * (r->samples[i] - r->mean);
}
r->std = sqrt(sq_sum / r->count);
// Percentiles
r->median = r->samples[r->count / 2];
r->p95 = r->samples[(size_t)(r->count * 0.95)];
r->p99 = r->samples[(size_t)(r->count * 0.99)];
}
void print_result(BenchmarkResult *r, const char *name) {
printf("%s:\n", name);
printf(" Mean: %.2f ns (σ = %.2f, CV = %.2f%%)\n",
r->mean, r->std, (r->std / r->mean) * 100);
printf(" Median: %.2f ns\n", r->median);
printf(" P95: %.2f ns, P99: %.2f ns\n", r->p95, r->p99);
}
Usage:
#define WARMUP 100
#define RUNS 1000
void benchmark_example(void) {
BenchmarkResult result;
result.samples = malloc(RUNS * sizeof(double));
result.count = RUNS;
// Warm-up (discard)
for (int i = 0; i < WARMUP; i++) do_something();
// Measurement
for (int i = 0; i < RUNS; i++) {
double start = get_time_ns();
do_something();
result.samples[i] = get_time_ns() - start;
}
analyze_result(&result);
print_result(&result, "do_something()");
free(result.samples);
}
Back to Lisa's Problem
Remember Lisa getting negative elapsed time with gettimeofday()?
That day she learned:
- Use the right timer:
CLOCK_MONOTONICwon't go backwards due to NTP - Understand timer overhead: Don't try to measure intervals too short
- CPU time vs. wall time: Choose based on your goal
A week later, she wrote a complete benchmark framework that became our team's standard tool.
"I had no idea timers were this complicated," she said.
"They are," I said. "But once you understand the pitfalls, you can write benchmarks that anyone can trust."
Summary
Correct measurement methodology is the core of reliable benchmarking. This chapter covered:
Timer Selection
- Use
CLOCK_MONOTONIC, not wall clocks - Understand timer overhead; don't measure intervals too short
- For cycle-level precision, use CPU cycle counters
CPU Time vs. Wall Time
- Wall time: user-perceived latency
- CPU time: actual CPU usage
- Their ratio tells you CPU utilization
Outlier Handling
- Conservative: keep all data, report percentiles
- Aggressive: remove outliers, but document your method
- Recommendation: do both
Statistical Significance
- Use hypothesis testing to confirm differences are real
- Check effect size to confirm differences matter
- Consider practical significance
Warm-up
- Discard unstable initial runs
- Use visual or automatic methods to determine warm-up count
- Always verify your assumptions