Chapter 35: CI/CD for Performance

Part IX: Synthesis

"What gets measured gets managed." — Peter Drucker

"What gets automated gets repeated." — DevOps wisdom

The "Nobody Noticed" Performance Regression

Six months ago, our API latency was 50ms.

Today, it's 150ms.

Nobody noticed. No alerts. No tickets.

How did this happen?

I ran git log and found 847 commits in those six months. Each commit degraded performance by an average of 0.12 ms—a difference imperceptible to humans.

But accumulated: 847 × 0.12ms = 100ms.

This is the horror of "Gradual Performance Regression." Like boiling a frog slowly, it gets a little slower each day, until one day customers start complaining.

The solution? Integrate performance testing into the CI/CD pipeline, so every commit goes through performance checks.

Why Performance CI/CD Is Needed

┌─────────────────────────────────────────────────────────────────┐
│                     Why Automate Performance Testing?           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Catch regressions early                                     │
│     └── Find problems before PR merge, not in production        │
│                                                                 │
│  2. Track trends over time                                      │
│     └── Historical data reveals gradual degradation             │
│                                                                 │
│  3. Reproducible measurements                                   │
│     └── Fixed environment eliminates "fast on my machine"       │
│                                                                 │
│  4. Shift left                                                  │
│     └── Find early, fix early, lower cost                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Performance CI Pipeline

A complete performance CI pipeline includes these stages:

┌──────────────────────────────────────────────────────────────────┐
│                    Performance CI Pipeline                       │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│   │ Trigger │───▶│ Setup   │───▶│ Run     │───▶│ Compare │      │
│   │ (PR)    │    │ Env     │    │ Bench   │    │ Results │      │
│   └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                        │              │          │
│                                        ▼              ▼          │
│                                  ┌─────────┐    ┌─────────┐      │
│                                  │ Store   │    │ Report  │      │
│                                  │ Data    │    │ (PR)    │      │
│                                  └─────────┘    └─────────┘      │
│                                        │                         │
│                                        ▼                         │
│                                  ┌─────────┐                     │
│                                  │ Alert   │                     │
│                                  │ (Slack) │                     │
│                                  └─────────┘                     │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Step 1: Dedicated Test Environment

This is the most important step.

Why Not Use GitHub-hosted Runners?

┌─────────────────────────────────────────────────────────────────┐
│                    Shared vs Dedicated Infrastructure           │
├────────────────────────────────────┬────────────────────────────┤
│         Shared Cloud Runner        │      Dedicated Machine     │
├────────────────────────────────────┼────────────────────────────┤
│ ❌ CPU "steal time" uncontrollable │ ✅ Full control of hardware│
│ ❌ Interference from other tenants │ ✅ No external interference│
│ ❌ VM config may differ each time  │ ✅ Completely consistent   │
│ ❌ Cannot fix CPU frequency        │ ✅ Can lock turbo, governor│
│ ❌ Variance up to 20-50%           │ ✅ Variance controlled 1-3%│
└────────────────────────────────────┴────────────────────────────┘

Environment Setup Script

#!/bin/bash
# setup_perf_env.sh - Setup performance test environment

# 1. Fix CPU frequency
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

# 2. Disable ASLR
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

# 3. Clear page cache
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# 4. Set CPU affinity (isolate cores 2-3 for testing)
echo "Benchmark will run on isolated CPUs 2-3"

# 5. Verify settings
echo "Environment configured:"
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
cat /proc/sys/kernel/randomize_va_space

Step 2: Benchmark Suite Design

Not all tests are suitable for CI. Need to balance:

Type	Time	Purpose
Smoke tests	< 1 min	Every commit, quick feedback
Core benchmarks	5-15 min	Every PR, critical paths
Full suite	30-60 min	Nightly, complete coverage
Soak tests	Hours	Weekly, find memory leaks

Benchmark Code Example (Go)

// benchmark_test.go
func BenchmarkHashLookup(b *testing.B) {
    table := buildHashTable(10000)
    keys := generateRandomKeys(1000)

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for _, key := range keys {

### Setting Thresholds

```yaml
# .github/perf-thresholds.yml
thresholds:
  # Change relative to baseline
  regression_threshold: 5%    # Fail if regression exceeds 5%
  improvement_threshold: 10%  # Manual review if improvement exceeds 10%

  # Absolute limits
  max_latency_p99: 100ms
  min_throughput: 1000 req/s

  # Statistical significance
  min_samples: 30
  confidence_level: 0.95

Statistical Significance Testing

Don't just compare means! Use statistical tests to determine if differences are significant:

from scipy import stats

def is_significant_regression(baseline, current, threshold=0.05):
    """
    Use Mann-Whitney U test to determine if there's significant regression
    """
    # Mann-Whitney U test (non-parametric)
    statistic, p_value = stats.mannwhitneyu(
        baseline, current,
        alternative='less'  # Test if current > baseline (regression)
    )

    if p_value < threshold:
        # Calculate effect size
        median_diff = np.median(current) - np.median(baseline)
        pct_diff = median_diff / np.median(baseline) * 100
        return True, pct_diff, p_value

    return False, 0, p_value

Step 4: GitHub Actions Integration

Complete Workflow Example

# .github/workflows/performance.yml
name: Performance Tests

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # Nightly at 2 AM

jobs:
  benchmark:
    runs-on: [self-hosted, perf-runner]  # Dedicated runner

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for comparison

      - name: Setup environment
        run: |
          sudo ./scripts/setup_perf_env.sh

      - name: Build
        run: |
          make build-release

      - name: Run benchmarks
        run: |
          ./scripts/run_benchmarks.sh --output results.json

      - name: Compare with baseline
        id: compare
        run: |
          python scripts/compare_results.py \
            --current results.json \
            --baseline benchmarks/baseline.json \
            --threshold 5 \
            --output comparison.md

      - name: Comment on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = fs.readFileSync('comparison.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

      - name: Fail on regression
        if: steps.compare.outputs.regression == 'true'
        run: exit 1

Step 5: Long-term Tracking and Visualization

Trend Chart

HashLookup Latency (ns) - Last 30 Days

160 ┤
    │
150 ┤     ╭─╮
    │    ╭╯ ╰╮
140 ┤───╯    ╰──────────────────────────────────── baseline
    │
130 ┤                    ╭──────────────────────╮
    │                   ╭╯                      ╰─
120 ┤                  ╭╯
    │                 ╭╯
110 ┼─────────────────┴─────────────────────────────
    └──────────────────────────────────────────────▶
     Day 1                                    Day 30

Common Pitfalls and Solutions

1. Flaky Benchmarks

Problem: Same commit, benchmark results differ each time

Solutions:
- Increase warm-up iterations
- Increase sample size
- Use median instead of mean
- Set variance threshold, re-run if exceeded

2. Environment Drift

Problem: Runner's OS update invalidates baseline

Solutions:
- Use Docker containers to fix environment
- Periodically rebuild baseline
- Record environment fingerprint

3. Over-sensitivity

Problem: 1% change triggers alert, too many false positives

Solutions:
- Raise threshold (5% is reasonable starting point)
- Use statistical significance testing
- Set cooldown period

4. Test Time Too Long

Problem: Full benchmark suite takes 2 hours

Solutions:
- Layer: smoke test (every commit) + full suite (nightly)
- Only run affected benchmarks
- Parallelize execution

Summary

Performance CI/CD Checklist:

Dedicated environment: Use self-hosted runners with fixed configuration
Layered testing: Smoke tests for every commit, full suite nightly
Statistical comparison: Don't just compare means, use proper tests
Automated reporting: Comment on PRs with clear results
Historical tracking: Store data for trend analysis
Smart alerting: Balance sensitivity with noise

The Key Insight:

Performance is not a feature you add at the end. It's a property you maintain continuously.

The Goal:

Every commit is performance-tested.
Every regression is caught before merge.
Every trend is visible to the team.

Next chapter, we'll enter Part VI and explore future trends and emerging technologies in performance analysis.

Performance and Benchmarking