Chapter 35: CI/CD for Performance
Part IX: Synthesis
"What gets measured gets managed." — Peter Drucker
"What gets automated gets repeated." — DevOps wisdom
The "Nobody Noticed" Performance Regression
Six months ago, our API latency was 50ms.
Today, it's 150ms.
Nobody noticed. No alerts. No tickets.
How did this happen?
I ran git log and found 847 commits in those six months. Each commit degraded performance by an average of 0.12 ms—a difference imperceptible to humans.
But accumulated: 847 × 0.12ms = 100ms.
This is the horror of "Gradual Performance Regression." Like boiling a frog slowly, it gets a little slower each day, until one day customers start complaining.
The solution? Integrate performance testing into the CI/CD pipeline, so every commit goes through performance checks.
Why Performance CI/CD Is Needed
┌─────────────────────────────────────────────────────────────────┐
│ Why Automate Performance Testing? │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Catch regressions early │
│ └── Find problems before PR merge, not in production │
│ │
│ 2. Track trends over time │
│ └── Historical data reveals gradual degradation │
│ │
│ 3. Reproducible measurements │
│ └── Fixed environment eliminates "fast on my machine" │
│ │
│ 4. Shift left │
│ └── Find early, fix early, lower cost │
│ │
└─────────────────────────────────────────────────────────────────┘
The Performance CI Pipeline
A complete performance CI pipeline includes these stages:
┌──────────────────────────────────────────────────────────────────┐
│ Performance CI Pipeline │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Trigger │───▶│ Setup │───▶│ Run │───▶│ Compare │ │
│ │ (PR) │ │ Env │ │ Bench │ │ Results │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Store │ │ Report │ │
│ │ Data │ │ (PR) │ │
│ └─────────┘ └─────────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ Alert │ │
│ │ (Slack) │ │
│ └─────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Step 1: Dedicated Test Environment
This is the most important step.
Why Not Use GitHub-hosted Runners?
┌─────────────────────────────────────────────────────────────────┐
│ Shared vs Dedicated Infrastructure │
├────────────────────────────────────┬────────────────────────────┤
│ Shared Cloud Runner │ Dedicated Machine │
├────────────────────────────────────┼────────────────────────────┤
│ ❌ CPU "steal time" uncontrollable │ ✅ Full control of hardware│
│ ❌ Interference from other tenants │ ✅ No external interference│
│ ❌ VM config may differ each time │ ✅ Completely consistent │
│ ❌ Cannot fix CPU frequency │ ✅ Can lock turbo, governor│
│ ❌ Variance up to 20-50% │ ✅ Variance controlled 1-3%│
└────────────────────────────────────┴────────────────────────────┘
Environment Setup Script
#!/bin/bash
# setup_perf_env.sh - Setup performance test environment
# 1. Fix CPU frequency
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# 2. Disable ASLR
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
# 3. Clear page cache
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
# 4. Set CPU affinity (isolate cores 2-3 for testing)
echo "Benchmark will run on isolated CPUs 2-3"
# 5. Verify settings
echo "Environment configured:"
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
cat /proc/sys/kernel/randomize_va_space
Step 2: Benchmark Suite Design
Not all tests are suitable for CI. Need to balance:
| Type | Time | Purpose |
|---|---|---|
| Smoke tests | < 1 min | Every commit, quick feedback |
| Core benchmarks | 5-15 min | Every PR, critical paths |
| Full suite | 30-60 min | Nightly, complete coverage |
| Soak tests | Hours | Weekly, find memory leaks |
Benchmark Code Example (Go)
// benchmark_test.go
func BenchmarkHashLookup(b *testing.B) {
table := buildHashTable(10000)
keys := generateRandomKeys(1000)
b.ResetTimer()
for i := 0; i < b.N; i++ {
for _, key := range keys {
### Setting Thresholds
```yaml
# .github/perf-thresholds.yml
thresholds:
# Change relative to baseline
regression_threshold: 5% # Fail if regression exceeds 5%
improvement_threshold: 10% # Manual review if improvement exceeds 10%
# Absolute limits
max_latency_p99: 100ms
min_throughput: 1000 req/s
# Statistical significance
min_samples: 30
confidence_level: 0.95
Statistical Significance Testing
Don't just compare means! Use statistical tests to determine if differences are significant:
from scipy import stats
def is_significant_regression(baseline, current, threshold=0.05):
"""
Use Mann-Whitney U test to determine if there's significant regression
"""
# Mann-Whitney U test (non-parametric)
statistic, p_value = stats.mannwhitneyu(
baseline, current,
alternative='less' # Test if current > baseline (regression)
)
if p_value < threshold:
# Calculate effect size
median_diff = np.median(current) - np.median(baseline)
pct_diff = median_diff / np.median(baseline) * 100
return True, pct_diff, p_value
return False, 0, p_value
Step 4: GitHub Actions Integration
Complete Workflow Example
# .github/workflows/performance.yml
name: Performance Tests
on:
pull_request:
branches: [main]
push:
branches: [main]
schedule:
- cron: '0 2 * * *' # Nightly at 2 AM
jobs:
benchmark:
runs-on: [self-hosted, perf-runner] # Dedicated runner
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history for comparison
- name: Setup environment
run: |
sudo ./scripts/setup_perf_env.sh
- name: Build
run: |
make build-release
- name: Run benchmarks
run: |
./scripts/run_benchmarks.sh --output results.json
- name: Compare with baseline
id: compare
run: |
python scripts/compare_results.py \
--current results.json \
--baseline benchmarks/baseline.json \
--threshold 5 \
--output comparison.md
- name: Comment on PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const body = fs.readFileSync('comparison.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
- name: Fail on regression
if: steps.compare.outputs.regression == 'true'
run: exit 1
Step 5: Long-term Tracking and Visualization
Trend Chart
HashLookup Latency (ns) - Last 30 Days
160 ┤
│
150 ┤ ╭─╮
│ ╭╯ ╰╮
140 ┤───╯ ╰──────────────────────────────────── baseline
│
130 ┤ ╭──────────────────────╮
│ ╭╯ ╰─
120 ┤ ╭╯
│ ╭╯
110 ┼─────────────────┴─────────────────────────────
└──────────────────────────────────────────────▶
Day 1 Day 30
Common Pitfalls and Solutions
1. Flaky Benchmarks
Problem: Same commit, benchmark results differ each time
Solutions:
- Increase warm-up iterations
- Increase sample size
- Use median instead of mean
- Set variance threshold, re-run if exceeded
2. Environment Drift
Problem: Runner's OS update invalidates baseline
Solutions:
- Use Docker containers to fix environment
- Periodically rebuild baseline
- Record environment fingerprint
3. Over-sensitivity
Problem: 1% change triggers alert, too many false positives
Solutions:
- Raise threshold (5% is reasonable starting point)
- Use statistical significance testing
- Set cooldown period
4. Test Time Too Long
Problem: Full benchmark suite takes 2 hours
Solutions:
- Layer: smoke test (every commit) + full suite (nightly)
- Only run affected benchmarks
- Parallelize execution
Summary
Performance CI/CD Checklist:
- Dedicated environment: Use self-hosted runners with fixed configuration
- Layered testing: Smoke tests for every commit, full suite nightly
- Statistical comparison: Don't just compare means, use proper tests
- Automated reporting: Comment on PRs with clear results
- Historical tracking: Store data for trend analysis
- Smart alerting: Balance sensitivity with noise
The Key Insight:
Performance is not a feature you add at the end. It's a property you maintain continuously.
The Goal:
Every commit is performance-tested.
Every regression is caught before merge.
Every trend is visible to the team.
Next chapter, we'll enter Part VI and explore future trends and emerging technologies in performance analysis.