Appendix A: Benchmark Automation


"If you can't automate it, you can't scale it." — DevOps Proverb

Why Automation Is Needed

Manual benchmark execution has several problems:

Problems:
1. Human error (forgetting to set environment, wrong parameters)
2. Non-reproducible (different conditions each run)
3. Time-consuming (requires manual waiting and recording)
4. Hard to track (results scattered everywhere)

Solution:
  Automate benchmark workflow
  Integrate into CI/CD pipeline
  Automatically detect performance regressions

CI/CD Integration

GitHub Actions Example

# .github/workflows/benchmark.yml
name: Performance Benchmark

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM

jobs:
  benchmark:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v4

    - name: Setup environment
      run: |
        # Lock CPU frequency (if possible)
        sudo cpupower frequency-set -g performance || true

        # Disable turbo boost
        echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo || true

    - name: Build
      run: |
        mkdir build && cd build
        cmake -DCMAKE_BUILD_TYPE=Release ..
        make -j$(nproc)

    - name: Run benchmarks
      run: |
        cd build
        ./benchmark --benchmark_format=json > benchmark_results.json

    - name: Upload results
      uses: actions/upload-artifact@v4
      with:
        name: benchmark-results
        path: build/benchmark_results.json

    - name: Compare with baseline
      run: |
        python scripts/compare_benchmarks.py \
          --baseline baseline.json \
          --current build/benchmark_results.json \
          --threshold 5

Dedicated Benchmark Runner

For serious performance testing, use a dedicated machine:

# Using self-hosted runner
jobs:
  benchmark:
    runs-on: [self-hosted, benchmark-machine]

    steps:
    - name: Ensure isolation
      run: |
        # Ensure no other programs running
        sudo systemctl stop cron
        sudo systemctl stop unattended-upgrades

        # Set CPU affinity
        taskset -c 0-3 ./benchmark

Google Benchmark Integration

Basic Setup

#include <benchmark/benchmark.h>

static void BM_VectorPushBack(benchmark::State& state) {
    for (auto _ : state) {
        std::vector<int> v;
        for (int i = 0; i < state.range(0); ++i) {
            v.push_back(i);
        }
    }
    state.SetComplexityN(state.range(0));
}

BENCHMARK(BM_VectorPushBack)
    ->Range(8, 8<<10)
    ->Complexity(benchmark::oN);

BENCHMARK_MAIN();

JSON Output

# Output JSON format
./benchmark --benchmark_format=json --benchmark_out=results.json

# Example output
{
  "context": {
    "date": "2024-01-15T10:30:00+08:00",
    "host_name": "benchmark-server",
    "executable": "./benchmark",
    "num_cpus": 8,
    "mhz_per_cpu": 3600,
    "cpu_scaling_enabled": false
  },
  "benchmarks": [
    {
      "name": "BM_VectorPushBack/8",
      "real_time": 45.2,
      "cpu_time": 44.8,
      "iterations": 15234567
    }
  ]
}

Comparison Tool

# Use Google Benchmark's comparison tool
pip install google-benchmark

# Compare two runs
compare.py benchmarks baseline.json current.json

# Output
Benchmark                    Time       CPU
--------------------------------------------
BM_VectorPushBack/8        -0.05     -0.04
BM_VectorPushBack/64       +0.12     +0.11  # Warning: regression
BM_VectorPushBack/512      -0.02     -0.03

Regression Detection

Statistical Method

import numpy as np
from scipy import stats

def detect_regression(baseline, current, threshold=0.05, alpha=0.05):
    """
    Use statistical test to detect performance regression

    Args:
        baseline: List of baseline measurements
        current: List of current measurements
        threshold: Acceptable performance change ratio
        alpha: Significance level

    Returns:
        (is_regression, p_value, change_percent)
    """
    # Calculate percent change
    baseline_mean = np.mean(baseline)
    current_mean = np.mean(current)
    change_percent = (current_mean - baseline_mean) / baseline_mean * 100

    # Perform t-test
    t_stat, p_value = stats.ttest_ind(baseline, current)

    # Determine if significant regression
    is_regression = (
        p_value < alpha and
        change_percent > threshold * 100
    )

    return is_regression, p_value, change_percent


# Usage example
baseline = [100.2, 101.5, 99.8, 100.1, 100.9]
current = [108.3, 109.1, 107.5, 108.8, 109.2]

is_reg, p, change = detect_regression(baseline, current)
print(f"Regression: {is_reg}, p-value: {p:.4f}, change: {change:.1f}%")
# Regression: True, p-value: 0.0001, change: 8.2%

Automation Script

#!/usr/bin/env python3
"""benchmark_compare.py - Compare benchmark results and detect regressions"""

import json
import sys
from pathlib import Path

def load_results(filepath):
    """Load Google Benchmark JSON results"""
    with open(filepath) as f:
        data = json.load(f)
    return {b['name']: b for b in data['benchmarks']}

def compare_results(baseline_path, current_path, threshold=5.0):
    """Compare two benchmark results"""
    baseline = load_results(baseline_path)
    current = load_results(current_path)

    regressions = []
    improvements = []

    for name, curr in current.items():
        if name not in baseline:
            continue

        base = baseline[name]
        change = (curr['real_time'] - base['real_time']) / base['real_time'] * 100

        if change > threshold:
            regressions.append((name, change))
        elif change < -threshold:
            improvements.append((name, change))

    return regressions, improvements

def main():
    if len(sys.argv) != 4:
        print("Usage: benchmark_compare.py <baseline> <current> <threshold>")
        sys.exit(1)

    baseline_path = sys.argv[1]
    current_path = sys.argv[2]
    threshold = float(sys.argv[3])

    regressions, improvements = compare_results(
        baseline_path, current_path, threshold
    )

    if regressions:
        print("❌ Performance Regressions Detected:")
        for name, change in regressions:
            print(f"  {name}: +{change:.1f}%")
        sys.exit(1)

    if improvements:
        print("✅ Performance Improvements:")
        for name, change in improvements:
            print(f"  {name}: {change:.1f}%")

    print("✅ No regressions detected")
    sys.exit(0)

if __name__ == "__main__":
    main()

Result Storage and Tracking

Database Storage

import sqlite3
from datetime import datetime

def init_db(db_path):
    """Initialize benchmark database"""
    conn = sqlite3.connect(db_path)
    conn.execute('''
        CREATE TABLE IF NOT EXISTS benchmarks (
            id INTEGER PRIMARY KEY,
            timestamp TEXT,
            commit_hash TEXT,
            benchmark_name TEXT,
            real_time REAL,
            cpu_time REAL,
            iterations INTEGER,
            UNIQUE(commit_hash, benchmark_name)
        )
    ''')
    conn.commit()
    return conn

def store_results(conn, commit_hash, results):
    """Store benchmark results"""
    timestamp = datetime.now().isoformat()

    for benchmark in results['benchmarks']:
        conn.execute('''
            INSERT OR REPLACE INTO benchmarks
            (timestamp, commit_hash, benchmark_name, real_time, cpu_time, iterations)
            VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            timestamp,
            commit_hash,
            benchmark['name'],
            benchmark['real_time'],
            benchmark['cpu_time'],
            benchmark['iterations']
        ))

    conn.commit()

def get_history(conn, benchmark_name, limit=100):
    """Get benchmark history"""
    cursor = conn.execute('''
        SELECT timestamp, commit_hash, real_time
        FROM benchmarks
        WHERE benchmark_name = ?
        ORDER BY timestamp DESC
        LIMIT ?
    ''', (benchmark_name, limit))

    return cursor.fetchall()

Visualization

import matplotlib.pyplot as plt
import pandas as pd

def plot_benchmark_history(history, benchmark_name):
    """Plot benchmark history trend"""
    df = pd.DataFrame(history, columns=['timestamp', 'commit', 'time'])
    df['timestamp'] = pd.to_datetime(df['timestamp'])

    plt.figure(figsize=(12, 6))
    plt.plot(df['timestamp'], df['time'], marker='o')
    plt.xlabel('Date')
    plt.ylabel('Time (ns)')
    plt.title(f'Benchmark History: {benchmark_name}')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig(f'{benchmark_name}_history.png')

Advanced Techniques

Environment Consistency Check

#!/bin/bash
# check_environment.sh - Check benchmark environment

echo "=== Environment Check ==="

# CPU frequency
echo "CPU Frequency:"
cat /proc/cpuinfo | grep "MHz" | head -1

# CPU Governor
echo "CPU Governor:"
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Turbo Boost
echo "Turbo Boost:"
cat /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null || echo "N/A"

# System load
echo "System Load:"
uptime

# Memory
echo "Memory:"
free -h

# Background processes
echo "Background Processes:"
ps aux | wc -l

Multiple Runs with Statistics

# Multiple runs in CI
- name: Run benchmarks (multiple iterations)
  run: |
    for i in {1..5}; do
      ./benchmark --benchmark_format=json > results_$i.json
    done

    # Merge results
    python scripts/merge_results.py results_*.json > final_results.json

Performance Budget

# performance_budget.yaml
benchmarks:
  BM_VectorPushBack/1024:
    max_time_ns: 50000
    max_memory_kb: 100

  BM_HashTableLookup:
    max_time_ns: 100

  BM_SortArray/10000:
    max_time_ns: 1000000
def check_budget(results, budget):
    """Check if performance exceeds budget"""
    violations = []

    for name, limits in budget['benchmarks'].items():
        if name not in results:
            continue

        result = results[name]

        if 'max_time_ns' in limits:
            if result['real_time'] > limits['max_time_ns']:
                violations.append(
                    f"{name}: {result['real_time']:.0f}ns > {limits['max_time_ns']}ns"
                )

    return violations

Summary

Key elements of benchmark automation:

CI/CD Integration

  • GitHub Actions / GitLab CI
  • Dedicated benchmark runner
  • Environment consistency

Automatic Regression Detection

  • Statistical testing
  • Threshold configuration
  • Automatic alerts

Result Management

  • Database storage
  • Historical tracking
  • Visualization

Best Practices

  • Multiple runs for statistics
  • Fixed environment settings
  • Performance budgets
  • Automated reporting