Chapter 35: CI/CD for Performance

Part IX: Synthesis


"What gets measured gets managed." — Peter Drucker

"What gets automated gets repeated." — DevOps wisdom

The "Nobody Noticed" Performance Regression

Six months ago, our API latency was 50ms.

Today, it's 150ms.

Nobody noticed. No alerts. No tickets.

How did this happen?

I ran git log and found 847 commits in those six months. Each commit degraded performance by an average of 0.12 ms—a difference imperceptible to humans.

But accumulated: 847 × 0.12ms = 100ms.

This is the horror of "Gradual Performance Regression." Like boiling a frog slowly, it gets a little slower each day, until one day customers start complaining.

The solution? Integrate performance testing into the CI/CD pipeline, so every commit goes through performance checks.

Why Performance CI/CD Is Needed

┌─────────────────────────────────────────────────────────────────┐
│                     Why Automate Performance Testing?           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Catch regressions early                                     │
│     └── Find problems before PR merge, not in production        │
│                                                                 │
│  2. Track trends over time                                      │
│     └── Historical data reveals gradual degradation             │
│                                                                 │
│  3. Reproducible measurements                                   │
│     └── Fixed environment eliminates "fast on my machine"       │
│                                                                 │
│  4. Shift left                                                  │
│     └── Find early, fix early, lower cost                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Performance CI Pipeline

A complete performance CI pipeline includes these stages:

┌──────────────────────────────────────────────────────────────────┐
│                    Performance CI Pipeline                       │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│   │ Trigger │───▶│ Setup   │───▶│ Run     │───▶│ Compare │      │
│   │ (PR)    │    │ Env     │    │ Bench   │    │ Results │      │
│   └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                        │              │          │
│                                        ▼              ▼          │
│                                  ┌─────────┐    ┌─────────┐      │
│                                  │ Store   │    │ Report  │      │
│                                  │ Data    │    │ (PR)    │      │
│                                  └─────────┘    └─────────┘      │
│                                        │                         │
│                                        ▼                         │
│                                  ┌─────────┐                     │
│                                  │ Alert   │                     │
│                                  │ (Slack) │                     │
│                                  └─────────┘                     │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Step 1: Dedicated Test Environment

This is the most important step.

Why Not Use GitHub-hosted Runners?

┌─────────────────────────────────────────────────────────────────┐
│                    Shared vs Dedicated Infrastructure           │
├────────────────────────────────────┬────────────────────────────┤
│         Shared Cloud Runner        │      Dedicated Machine     │
├────────────────────────────────────┼────────────────────────────┤
│ ❌ CPU "steal time" uncontrollable │ ✅ Full control of hardware│
│ ❌ Interference from other tenants │ ✅ No external interference│
│ ❌ VM config may differ each time  │ ✅ Completely consistent   │
│ ❌ Cannot fix CPU frequency        │ ✅ Can lock turbo, governor│
│ ❌ Variance up to 20-50%           │ ✅ Variance controlled 1-3%│
└────────────────────────────────────┴────────────────────────────┘

Environment Setup Script

#!/bin/bash
# setup_perf_env.sh - Setup performance test environment

# 1. Fix CPU frequency
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

# 2. Disable ASLR
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

# 3. Clear page cache
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# 4. Set CPU affinity (isolate cores 2-3 for testing)
echo "Benchmark will run on isolated CPUs 2-3"

# 5. Verify settings
echo "Environment configured:"
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
cat /proc/sys/kernel/randomize_va_space

Step 2: Benchmark Suite Design

Not all tests are suitable for CI. Need to balance:

TypeTimePurpose
Smoke tests< 1 minEvery commit, quick feedback
Core benchmarks5-15 minEvery PR, critical paths
Full suite30-60 minNightly, complete coverage
Soak testsHoursWeekly, find memory leaks

Benchmark Code Example (Go)

// benchmark_test.go
func BenchmarkHashLookup(b *testing.B) {
    table := buildHashTable(10000)
    keys := generateRandomKeys(1000)

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for _, key := range keys {

### Setting Thresholds

```yaml
# .github/perf-thresholds.yml
thresholds:
  # Change relative to baseline
  regression_threshold: 5%    # Fail if regression exceeds 5%
  improvement_threshold: 10%  # Manual review if improvement exceeds 10%

  # Absolute limits
  max_latency_p99: 100ms
  min_throughput: 1000 req/s

  # Statistical significance
  min_samples: 30
  confidence_level: 0.95

Statistical Significance Testing

Don't just compare means! Use statistical tests to determine if differences are significant:

from scipy import stats

def is_significant_regression(baseline, current, threshold=0.05):
    """
    Use Mann-Whitney U test to determine if there's significant regression
    """
    # Mann-Whitney U test (non-parametric)
    statistic, p_value = stats.mannwhitneyu(
        baseline, current,
        alternative='less'  # Test if current > baseline (regression)
    )

    if p_value < threshold:
        # Calculate effect size
        median_diff = np.median(current) - np.median(baseline)
        pct_diff = median_diff / np.median(baseline) * 100
        return True, pct_diff, p_value

    return False, 0, p_value

Step 4: GitHub Actions Integration

Complete Workflow Example

# .github/workflows/performance.yml
name: Performance Tests

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # Nightly at 2 AM

jobs:
  benchmark:
    runs-on: [self-hosted, perf-runner]  # Dedicated runner

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for comparison

      - name: Setup environment
        run: |
          sudo ./scripts/setup_perf_env.sh

      - name: Build
        run: |
          make build-release

      - name: Run benchmarks
        run: |
          ./scripts/run_benchmarks.sh --output results.json

      - name: Compare with baseline
        id: compare
        run: |
          python scripts/compare_results.py \
            --current results.json \
            --baseline benchmarks/baseline.json \
            --threshold 5 \
            --output comparison.md

      - name: Comment on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = fs.readFileSync('comparison.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

      - name: Fail on regression
        if: steps.compare.outputs.regression == 'true'
        run: exit 1

Step 5: Long-term Tracking and Visualization

Trend Chart

HashLookup Latency (ns) - Last 30 Days

160 ┤
    │
150 ┤     ╭─╮
    │    ╭╯ ╰╮
140 ┤───╯    ╰──────────────────────────────────── baseline
    │
130 ┤                    ╭──────────────────────╮
    │                   ╭╯                      ╰─
120 ┤                  ╭╯
    │                 ╭╯
110 ┼─────────────────┴─────────────────────────────
    └──────────────────────────────────────────────▶
     Day 1                                    Day 30

Common Pitfalls and Solutions

1. Flaky Benchmarks

Problem: Same commit, benchmark results differ each time

Solutions:
- Increase warm-up iterations
- Increase sample size
- Use median instead of mean
- Set variance threshold, re-run if exceeded

2. Environment Drift

Problem: Runner's OS update invalidates baseline

Solutions:
- Use Docker containers to fix environment
- Periodically rebuild baseline
- Record environment fingerprint

3. Over-sensitivity

Problem: 1% change triggers alert, too many false positives

Solutions:
- Raise threshold (5% is reasonable starting point)
- Use statistical significance testing
- Set cooldown period

4. Test Time Too Long

Problem: Full benchmark suite takes 2 hours

Solutions:
- Layer: smoke test (every commit) + full suite (nightly)
- Only run affected benchmarks
- Parallelize execution

Summary

Performance CI/CD Checklist:

  1. Dedicated environment: Use self-hosted runners with fixed configuration
  2. Layered testing: Smoke tests for every commit, full suite nightly
  3. Statistical comparison: Don't just compare means, use proper tests
  4. Automated reporting: Comment on PRs with clear results
  5. Historical tracking: Store data for trend analysis
  6. Smart alerting: Balance sensitivity with noise

The Key Insight:

Performance is not a feature you add at the end. It's a property you maintain continuously.

The Goal:

Every commit is performance-tested.
Every regression is caught before merge.
Every trend is visible to the team.

Next chapter, we'll enter Part VI and explore future trends and emerging technologies in performance analysis.