Appendix A: Benchmark Automation

"If you can't automate it, you can't scale it." — DevOps Proverb

Why Automation Is Needed

Manual benchmark execution has several problems:

Problems:
1. Human error (forgetting to set environment, wrong parameters)
2. Non-reproducible (different conditions each run)
3. Time-consuming (requires manual waiting and recording)
4. Hard to track (results scattered everywhere)

Solution:
  Automate benchmark workflow
  Integrate into CI/CD pipeline
  Automatically detect performance regressions

CI/CD Integration

GitHub Actions Example

# .github/workflows/benchmark.yml
name: Performance Benchmark

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM

jobs:
  benchmark:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v4

    - name: Setup environment
      run: |
        # Lock CPU frequency (if possible)
        sudo cpupower frequency-set -g performance || true

        # Disable turbo boost
        echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo || true

    - name: Build
      run: |
        mkdir build && cd build
        cmake -DCMAKE_BUILD_TYPE=Release ..
        make -j$(nproc)

    - name: Run benchmarks
      run: |
        cd build
        ./benchmark --benchmark_format=json > benchmark_results.json

    - name: Upload results
      uses: actions/upload-artifact@v4
      with:
        name: benchmark-results
        path: build/benchmark_results.json

    - name: Compare with baseline
      run: |
        python scripts/compare_benchmarks.py \
          --baseline baseline.json \
          --current build/benchmark_results.json \
          --threshold 5

Dedicated Benchmark Runner

For serious performance testing, use a dedicated machine:

# Using self-hosted runner
jobs:
  benchmark:
    runs-on: [self-hosted, benchmark-machine]

    steps:
    - name: Ensure isolation
      run: |
        # Ensure no other programs running
        sudo systemctl stop cron
        sudo systemctl stop unattended-upgrades

        # Set CPU affinity
        taskset -c 0-3 ./benchmark

Google Benchmark Integration

Basic Setup

#include <benchmark/benchmark.h>

static void BM_VectorPushBack(benchmark::State& state) {
    for (auto _ : state) {
        std::vector<int> v;
        for (int i = 0; i < state.range(0); ++i) {
            v.push_back(i);
        }
    }
    state.SetComplexityN(state.range(0));
}

BENCHMARK(BM_VectorPushBack)
    ->Range(8, 8<<10)
    ->Complexity(benchmark::oN);

BENCHMARK_MAIN();

JSON Output

# Output JSON format
./benchmark --benchmark_format=json --benchmark_out=results.json

# Example output
{
  "context": {
    "date": "2024-01-15T10:30:00+08:00",
    "host_name": "benchmark-server",
    "executable": "./benchmark",
    "num_cpus": 8,
    "mhz_per_cpu": 3600,
    "cpu_scaling_enabled": false
  },
  "benchmarks": [
    {
      "name": "BM_VectorPushBack/8",
      "real_time": 45.2,
      "cpu_time": 44.8,
      "iterations": 15234567
    }
  ]
}

Comparison Tool

# Use Google Benchmark's comparison tool
pip install google-benchmark

# Compare two runs
compare.py benchmarks baseline.json current.json

# Output
Benchmark                    Time       CPU
--------------------------------------------
BM_VectorPushBack/8        -0.05     -0.04
BM_VectorPushBack/64       +0.12     +0.11  # Warning: regression
BM_VectorPushBack/512      -0.02     -0.03

Regression Detection

Statistical Method

import numpy as np
from scipy import stats

def detect_regression(baseline, current, threshold=0.05, alpha=0.05):
    """
    Use statistical test to detect performance regression

    Args:
        baseline: List of baseline measurements
        current: List of current measurements
        threshold: Acceptable performance change ratio
        alpha: Significance level

    Returns:
        (is_regression, p_value, change_percent)
    """
    # Calculate percent change
    baseline_mean = np.mean(baseline)
    current_mean = np.mean(current)
    change_percent = (current_mean - baseline_mean) / baseline_mean * 100

    # Perform t-test
    t_stat, p_value = stats.ttest_ind(baseline, current)

    # Determine if significant regression
    is_regression = (
        p_value < alpha and
        change_percent > threshold * 100
    )

    return is_regression, p_value, change_percent


# Usage example
baseline = [100.2, 101.5, 99.8, 100.1, 100.9]
current = [108.3, 109.1, 107.5, 108.8, 109.2]

is_reg, p, change = detect_regression(baseline, current)
print(f"Regression: {is_reg}, p-value: {p:.4f}, change: {change:.1f}%")
# Regression: True, p-value: 0.0001, change: 8.2%

Automation Script

#!/usr/bin/env python3
"""benchmark_compare.py - Compare benchmark results and detect regressions"""

import json
import sys
from pathlib import Path

def load_results(filepath):
    """Load Google Benchmark JSON results"""
    with open(filepath) as f:
        data = json.load(f)
    return {b['name']: b for b in data['benchmarks']}

def compare_results(baseline_path, current_path, threshold=5.0):
    """Compare two benchmark results"""
    baseline = load_results(baseline_path)
    current = load_results(current_path)

    regressions = []
    improvements = []

    for name, curr in current.items():
        if name not in baseline:
            continue

        base = baseline[name]
        change = (curr['real_time'] - base['real_time']) / base['real_time'] * 100

        if change > threshold:
            regressions.append((name, change))
        elif change < -threshold:
            improvements.append((name, change))

    return regressions, improvements

def main():
    if len(sys.argv) != 4:
        print("Usage: benchmark_compare.py <baseline> <current> <threshold>")
        sys.exit(1)

    baseline_path = sys.argv[1]
    current_path = sys.argv[2]
    threshold = float(sys.argv[3])

    regressions, improvements = compare_results(
        baseline_path, current_path, threshold
    )

    if regressions:
        print("❌ Performance Regressions Detected:")
        for name, change in regressions:
            print(f"  {name}: +{change:.1f}%")
        sys.exit(1)

    if improvements:
        print("✅ Performance Improvements:")
        for name, change in improvements:
            print(f"  {name}: {change:.1f}%")

    print("✅ No regressions detected")
    sys.exit(0)

if __name__ == "__main__":
    main()

Result Storage and Tracking

Database Storage

import sqlite3
from datetime import datetime

def init_db(db_path):
    """Initialize benchmark database"""
    conn = sqlite3.connect(db_path)
    conn.execute('''
        CREATE TABLE IF NOT EXISTS benchmarks (
            id INTEGER PRIMARY KEY,
            timestamp TEXT,
            commit_hash TEXT,
            benchmark_name TEXT,
            real_time REAL,
            cpu_time REAL,
            iterations INTEGER,
            UNIQUE(commit_hash, benchmark_name)
        )
    ''')
    conn.commit()
    return conn

def store_results(conn, commit_hash, results):
    """Store benchmark results"""
    timestamp = datetime.now().isoformat()

    for benchmark in results['benchmarks']:
        conn.execute('''
            INSERT OR REPLACE INTO benchmarks
            (timestamp, commit_hash, benchmark_name, real_time, cpu_time, iterations)
            VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            timestamp,
            commit_hash,
            benchmark['name'],
            benchmark['real_time'],
            benchmark['cpu_time'],
            benchmark['iterations']
        ))

    conn.commit()

def get_history(conn, benchmark_name, limit=100):
    """Get benchmark history"""
    cursor = conn.execute('''
        SELECT timestamp, commit_hash, real_time
        FROM benchmarks
        WHERE benchmark_name = ?
        ORDER BY timestamp DESC
        LIMIT ?
    ''', (benchmark_name, limit))

    return cursor.fetchall()

Visualization

import matplotlib.pyplot as plt
import pandas as pd

def plot_benchmark_history(history, benchmark_name):
    """Plot benchmark history trend"""
    df = pd.DataFrame(history, columns=['timestamp', 'commit', 'time'])
    df['timestamp'] = pd.to_datetime(df['timestamp'])

    plt.figure(figsize=(12, 6))
    plt.plot(df['timestamp'], df['time'], marker='o')
    plt.xlabel('Date')
    plt.ylabel('Time (ns)')
    plt.title(f'Benchmark History: {benchmark_name}')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig(f'{benchmark_name}_history.png')

Advanced Techniques

Environment Consistency Check

#!/bin/bash
# check_environment.sh - Check benchmark environment

echo "=== Environment Check ==="

# CPU frequency
echo "CPU Frequency:"
cat /proc/cpuinfo | grep "MHz" | head -1

# CPU Governor
echo "CPU Governor:"
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Turbo Boost
echo "Turbo Boost:"
cat /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null || echo "N/A"

# System load
echo "System Load:"
uptime

# Memory
echo "Memory:"
free -h

# Background processes
echo "Background Processes:"
ps aux | wc -l

Multiple Runs with Statistics

# Multiple runs in CI
- name: Run benchmarks (multiple iterations)
  run: |
    for i in {1..5}; do
      ./benchmark --benchmark_format=json > results_$i.json
    done

    # Merge results
    python scripts/merge_results.py results_*.json > final_results.json

Performance Budget

# performance_budget.yaml
benchmarks:
  BM_VectorPushBack/1024:
    max_time_ns: 50000
    max_memory_kb: 100

  BM_HashTableLookup:
    max_time_ns: 100

  BM_SortArray/10000:
    max_time_ns: 1000000

def check_budget(results, budget):
    """Check if performance exceeds budget"""
    violations = []

    for name, limits in budget['benchmarks'].items():
        if name not in results:
            continue

        result = results[name]

        if 'max_time_ns' in limits:
            if result['real_time'] > limits['max_time_ns']:
                violations.append(
                    f"{name}: {result['real_time']:.0f}ns > {limits['max_time_ns']}ns"
                )

    return violations

Summary

Key elements of benchmark automation:

CI/CD Integration

GitHub Actions / GitLab CI
Dedicated benchmark runner
Environment consistency

Automatic Regression Detection

Statistical testing
Threshold configuration
Automatic alerts

Result Management

Database storage
Historical tracking
Visualization

Best Practices

Multiple runs for statistics
Fixed environment settings
Performance budgets
Automated reporting

Performance and Benchmarking