Chapter 4: Presenting Results
Part I: Foundations
"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey
The Chart That Lost Us the Contract
This happened a few years ago. Our team spent three months optimizing a critical module, achieving a 35% performance improvement. We were excited to present the results to the client.
My colleague Mark prepared the presentation. He made a bar chart in Excel:
Performance Comparison
Old System ████████████████████████████████████████ 1000 ms
New System █████ 650 ms
Looks great, right? The new system is clearly much shorter.
But wait—that's the problem.
But the client's CTO frowned. "Wait, where does the Y-axis start?"
"At 600," Mark said. "It makes the difference more visible."
The room went silent for a few seconds.
"So actually," the CTO said, "you went from 1000 to 650. If the Y-axis started at zero, the difference would look like this:"
Performance Comparison (Y-axis from 0)
Old ████████████████████████████████████████████████████████████████████████████████ 1000 ms
New █████████████████████████████████████████████████████████████████ 650 ms
"Your 35% improvement is real," he said. "But your chart made it look like 80%. That makes me wonder if your other data has similar problems."
We didn't get that contract.
Common Misleading Chart Techniques
Mark's mistake is common. Here are techniques frequently used (intentionally or not) to exaggerate results:
1. Truncated Y-Axis
The most common trick. Start the Y-axis at a non-zero value, and small differences look huge.
Misleading (Y starts at 95):
A █ 95
B ████████████ 100
Honest (Y starts at 0):
A ███████████████████████████████████████████████ 95
B ██████████████████████████████████████████████████ 100
A 5% difference looks like a 5× difference in the truncated version.
2. Selective Data Range
Show only the data points that favor you.
"Our product has been ahead for the last three months!"
(But we were behind for the previous six months—not shown)
3. Dual Y-Axis Abuse
Use two Y-axes with different scales to make unrelated trends appear correlated.
Left Y-axis: Sales (0-1000)
Right Y-axis: Temperature (20-25°C)
"Look! Temperature and sales are perfectly correlated!"
(It's just a coincidence from scale manipulation)
4. 3D Effects
3D charts look fancy but distort visual proportions.
5. Area vs. Length
When using circle sizes to represent values, people confuse area with diameter.
A = 100, B = 200
If using circle area:
A radius = 10
B radius = 14.14 (√2 times)
Visually, B looks only slightly larger, but it's actually 2×.
How to Present Benchmark Results Correctly
Rule 1: Start Y-Axis at Zero (Unless You Have Good Reason)
The only exception is when the data range is truly narrow, and you explicitly label it.
OK: "Note: Y-axis starts at 950 to show fine differences"
NOT OK: Sneakily starting at non-zero, hoping nobody notices
Rule 2: Show Uncertainty
Always include error bars. A bar chart without error bars is incomplete.
Performance (ms)
Mean [95% CI]
Algorithm A: 100 ████████████████████├──┤
Algorithm B: 95 ███████████████████├────┤
If error bars overlap, the difference may not be significant.
Rule 3: State Sample Size
"N = 1000 runs, 95% confidence interval"
A chart could come from 10 tests or 10,000 tests—the meaning is completely different.
Rule 4: Provide Raw Data or Distribution
Choosing the Right Chart Type
Different data needs different visualization.
Bar Chart: Compare Discrete Categories
Good for: comparing algorithms, systems, configurations
Throughput (ops/sec)
Algorithm A ████████████████████████ 2400
Algorithm B ██████████████████ 1800
Algorithm C ██████████████████████████████ 3000
Note: Bar charts are for independent categories, not trends.
Line Chart: Show Trends
Good for: change over time, change with parameters
Latency vs Data Size
Latency
(ms)
│ ●
│ ●
│ ●
│ ●
│ ●
│ ●
│ ●
│ ●
└────────────────────────────────
1KB 10KB 100KB 1MB 10MB
Data Size
Note: Line charts imply continuity. If your data is discrete, use bars.
Scatter Plot: Show Correlation or Distribution
Good for: relationships between variables, individual run results
Latency vs Throughput
Latency
│ ●
│ ● ●
│ ●●●
│ ●●●●
│ ●●●●
│ ●●●
│ ●●
│ ●
└──────────────────────
Throughput
Box Plot: Compare Distributions
Good for: comparing spread across multiple groups
Latency by Configuration
Config A Config B Config C
│ │ │
○ │ │ ← outlier
│ │ │
┌──┴──┐ ┌──┴──┐ ┌──┴──┐
│ │ │ │ │ │
├─────┤ ├─────┤ ├─────┤ ← median
│ │ │ │ │ │
└──┬──┘ └──┬──┘ └──┬──┘
│ │ │
│ ○ │ ← outlier
Box plots show: median (center line), quartiles (box), range (whiskers), outliers (dots).
Heatmap: Multi-Dimensional Data
Good for: effect of two parameters on performance
Throughput Heatmap
Thread Count
1 2 4 8 16
┌────┬────┬────┬────┬────┐
1KB │ ░░ │ ▒▒ │ ▓▓ │ ▓▓ │ ▒▒ │
├────┼────┼────┼────┼────┤
10KB │ ░░ │ ▒▒ │ ▓▓ │ ██ │ ▓▓ │
├────┼────┼────┼────┼────┤
100KB │ ░░ │ ▒▒ │ ▓▓ │ ██ │ ██ │
└────┴────┴────┴────┴────┘
Buffer
Size ░ Low ▒ Med ▓ High █ Best
Log Scale: When to Use It
When data spans multiple orders of magnitude, linear scale makes small values invisible.
Linear Scale:
Algorithm A █ 1 ms
Algorithm B ██████████████████████████████████████████████████ 1000 ms
Log Scale:
Algorithm A ██████████ 1 ms
Algorithm B ██████████████████████████████████████████████████ 1000 ms
(3 orders of magnitude difference)
When to use log scale:
- Data spans 2+ orders of magnitude
- You care about "ratios" rather than "absolute differences"
- Comparing different scales (like latency percentiles: p50, p99, p99.9)
Caution: Log scale makes large differences look smaller. Ensure readers understand it's logarithmic.
Fair Comparisons
Same Conditions
Comparisons must use identical conditions. If you change multiple variables, you don't know what caused the difference.
Bad:
"Algorithm A on Intel Xeon vs Algorithm B on AMD EPYC"
Good:
"Algorithm A vs Algorithm B, both on Intel Xeon E5-2690"
Baseline Choice
Your choice of baseline affects interpretation.
Scenario 1: A is baseline
A: 1.00× (baseline)
B: 1.35× faster
Scenario 2: B is baseline
A: 0.74× (26% slower)
B: 1.00× (baseline)
Same data, different narrative. Choose a reasonable baseline (usually "current system" or "industry standard") and be consistent.
Avoid Cherry-Picking
Don't only show favorable test cases.
Bad:
"Our system is 3× faster!" (on one specific workload we optimized for)
Good:
"Our system is 3× faster on workload A, 1.2× faster on workload B,
and 0.9× (10% slower) on workload C"
Report all results honestly, including where you perform worse.
Structure of a Benchmark Report
A complete benchmark report should include:
1. Executive Summary
One paragraph summarizing the key findings. For people who don't have time to read the full report.
2. Test Environment
## Test Environment
- **Hardware**: Intel Xeon E5-2690 v4 @ 2.6GHz, 128GB RAM
- **OS**: Ubuntu 22.04 LTS, kernel 5.15.0
- **Compiler**: GCC 11.2 with -O3
- **Date**: 2024-01-15
3. Methodology
- How many runs?
- How many warm-up iterations?
- How were outliers handled?
- What statistical methods were used?
4. Results
Charts + data tables. Charts give visual impression; tables give precise numbers.
5. Analysis
Explain the results. Why is A faster than B? Where are the bottlenecks?
6. Limitations
Honestly state test limitations.
## Limitations
- Tests performed on a single machine; results may vary on different hardware
- Only tested with synthetic workloads; real-world performance may differ
- Memory-bound workloads not covered in this benchmark
7. Raw Data
Provide raw data for readers to analyze themselves (in appendix or via link).
Practical Tools
Simple Charts: gnuplot
set terminal png size 800,600
set output 'benchmark.png'
set title 'Algorithm Performance'
set xlabel 'Data Size'
set ylabel 'Time (ms)'
set style data linespoints
plot 'data.txt' using 1:2 title 'Algorithm A', \
'data.txt' using 1:3 title 'Algorithm B'
Statistical Charts: Python + matplotlib
import matplotlib.pyplot as plt
import numpy as np
data_a = [100, 102, 98, 105, 97, ...]
data_b = [95, 93, 97, 94, 96, ...]
fig, ax = plt.subplots()
bp = ax.boxplot([data_a, data_b], labels=['Algorithm A', 'Algorithm B'])
ax.set_ylabel('Latency (μs)')
ax.set_title('Latency Comparison')
plt.savefig('comparison.png', dpi=150)
Interactive: Jupyter Notebook
Jupyter Notebooks let you combine code, data, charts, and analysis text in one place—easy to reproduce and share.
Back to Mark's Story
After that failure, our team established visualization standards:
- Y-axis starts at zero (unless explicitly labeled)
- Always include error bars
- State sample size and test environment
- Provide raw data
- Report all results honestly, including bad ones
Six months later, we had another chance to present to the same client. This time our charts didn't look as "impressive," but the CTO said:
"This is how I want to see data presented. Your improvement is 35%, and your chart clearly shows 35%—no more, no less. That makes me trust your other data too."
We got that contract.
Summary
Presenting benchmark results correctly is as important as measuring correctly. This chapter covered:
Avoiding Misleading Charts
- Don't truncate Y-axis (unless clearly labeled)
- Don't cherry-pick data ranges
- Avoid 3D effects and misleading area comparisons
Correct Visualization
- Always show error bars
- State sample size and test conditions
- Provide raw data or distributions
Choosing the Right Chart
- Bar chart: compare discrete categories
- Line chart: show trends
- Scatter plot: correlation
- Box plot: compare distributions
- Heatmap: multi-dimensional data
Fair Comparisons
- Compare under identical conditions
- Choose a reasonable baseline
- Report all results, not just favorable ones
Complete Reports
- Executive summary
- Test environment and methodology
- Results and analysis
- Limitations
- Raw data