Chapter 4: Presenting Results

Part I: Foundations


"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey

The Chart That Lost Us the Contract

This happened a few years ago. Our team spent three months optimizing a critical module, achieving a 35% performance improvement. We were excited to present the results to the client.

My colleague Mark prepared the presentation. He made a bar chart in Excel:

Performance Comparison

Old System    ████████████████████████████████████████  1000 ms
New System    █████  650 ms

Looks great, right? The new system is clearly much shorter.

But wait—that's the problem.

But the client's CTO frowned. "Wait, where does the Y-axis start?"

"At 600," Mark said. "It makes the difference more visible."

The room went silent for a few seconds.

"So actually," the CTO said, "you went from 1000 to 650. If the Y-axis started at zero, the difference would look like this:"

Performance Comparison (Y-axis from 0)

Old    ████████████████████████████████████████████████████████████████████████████████  1000 ms
New    █████████████████████████████████████████████████████████████████  650 ms

"Your 35% improvement is real," he said. "But your chart made it look like 80%. That makes me wonder if your other data has similar problems."

We didn't get that contract.

Common Misleading Chart Techniques

Mark's mistake is common. Here are techniques frequently used (intentionally or not) to exaggerate results:

1. Truncated Y-Axis

The most common trick. Start the Y-axis at a non-zero value, and small differences look huge.

Misleading (Y starts at 95):
A    █  95
B    ████████████  100

Honest (Y starts at 0):
A    ███████████████████████████████████████████████  95
B    ██████████████████████████████████████████████████  100

A 5% difference looks like a 5× difference in the truncated version.

2. Selective Data Range

Show only the data points that favor you.

"Our product has been ahead for the last three months!"
(But we were behind for the previous six months—not shown)

3. Dual Y-Axis Abuse

Use two Y-axes with different scales to make unrelated trends appear correlated.

Left Y-axis: Sales (0-1000)
Right Y-axis: Temperature (20-25°C)

"Look! Temperature and sales are perfectly correlated!"
(It's just a coincidence from scale manipulation)

4. 3D Effects

3D charts look fancy but distort visual proportions.

5. Area vs. Length

When using circle sizes to represent values, people confuse area with diameter.

A = 100, B = 200

If using circle area:
  A radius = 10
  B radius = 14.14 (√2 times)

Visually, B looks only slightly larger, but it's actually 2×.

How to Present Benchmark Results Correctly

Rule 1: Start Y-Axis at Zero (Unless You Have Good Reason)

The only exception is when the data range is truly narrow, and you explicitly label it.

OK: "Note: Y-axis starts at 950 to show fine differences"
NOT OK: Sneakily starting at non-zero, hoping nobody notices

Rule 2: Show Uncertainty

Always include error bars. A bar chart without error bars is incomplete.

Performance (ms)
                Mean   [95% CI]
Algorithm A:    100    ████████████████████├──┤
Algorithm B:     95    ███████████████████├────┤

If error bars overlap, the difference may not be significant.

Rule 3: State Sample Size

"N = 1000 runs, 95% confidence interval"

A chart could come from 10 tests or 10,000 tests—the meaning is completely different.

Rule 4: Provide Raw Data or Distribution

Choosing the Right Chart Type

Different data needs different visualization.

Bar Chart: Compare Discrete Categories

Good for: comparing algorithms, systems, configurations

Throughput (ops/sec)

Algorithm A  ████████████████████████  2400
Algorithm B  ██████████████████  1800
Algorithm C  ██████████████████████████████  3000

Note: Bar charts are for independent categories, not trends.

Good for: change over time, change with parameters

Latency vs Data Size

Latency
(ms)
  │                                    ●
  │                               ●
  │                          ●
  │                     ●
  │                ●
  │           ●
  │      ●
  │ ●
  └────────────────────────────────
    1KB   10KB   100KB   1MB   10MB
              Data Size

Note: Line charts imply continuity. If your data is discrete, use bars.

Scatter Plot: Show Correlation or Distribution

Good for: relationships between variables, individual run results

Latency vs Throughput

Latency
  │  ●
  │    ●  ●
  │      ●●●
  │        ●●●●
  │          ●●●●
  │            ●●●
  │              ●●
  │                ●
  └──────────────────────
                  Throughput

Box Plot: Compare Distributions

Good for: comparing spread across multiple groups

Latency by Configuration

           Config A      Config B      Config C
              │             │              │
              ○             │              │    ← outlier
              │             │              │
           ┌──┴──┐       ┌──┴──┐        ┌──┴──┐
           │     │       │     │        │     │
           ├─────┤       ├─────┤        ├─────┤  ← median
           │     │       │     │        │     │
           └──┬──┘       └──┬──┘        └──┬──┘
              │             │              │
              │             ○              │    ← outlier

Box plots show: median (center line), quartiles (box), range (whiskers), outliers (dots).

Heatmap: Multi-Dimensional Data

Good for: effect of two parameters on performance

Throughput Heatmap

Thread Count
         1    2    4    8    16
      ┌────┬────┬────┬────┬────┐
  1KB │ ░░ │ ▒▒ │ ▓▓ │ ▓▓ │ ▒▒ │
      ├────┼────┼────┼────┼────┤
 10KB │ ░░ │ ▒▒ │ ▓▓ │ ██ │ ▓▓ │
      ├────┼────┼────┼────┼────┤
100KB │ ░░ │ ▒▒ │ ▓▓ │ ██ │ ██ │
      └────┴────┴────┴────┴────┘
Buffer
Size         ░ Low  ▒ Med  ▓ High  █ Best

Log Scale: When to Use It

When data spans multiple orders of magnitude, linear scale makes small values invisible.

Linear Scale:
Algorithm A  █  1 ms
Algorithm B  ██████████████████████████████████████████████████  1000 ms

Log Scale:
Algorithm A  ██████████  1 ms
Algorithm B  ██████████████████████████████████████████████████  1000 ms
                   (3 orders of magnitude difference)

When to use log scale:

  • Data spans 2+ orders of magnitude
  • You care about "ratios" rather than "absolute differences"
  • Comparing different scales (like latency percentiles: p50, p99, p99.9)

Caution: Log scale makes large differences look smaller. Ensure readers understand it's logarithmic.

Fair Comparisons

Same Conditions

Comparisons must use identical conditions. If you change multiple variables, you don't know what caused the difference.

Bad:
"Algorithm A on Intel Xeon vs Algorithm B on AMD EPYC"

Good:
"Algorithm A vs Algorithm B, both on Intel Xeon E5-2690"

Baseline Choice

Your choice of baseline affects interpretation.

Scenario 1: A is baseline
  A: 1.00× (baseline)
  B: 1.35× faster

Scenario 2: B is baseline
  A: 0.74× (26% slower)
  B: 1.00× (baseline)

Same data, different narrative. Choose a reasonable baseline (usually "current system" or "industry standard") and be consistent.

Avoid Cherry-Picking

Don't only show favorable test cases.

Bad:
"Our system is 3× faster!" (on one specific workload we optimized for)

Good:
"Our system is 3× faster on workload A, 1.2× faster on workload B,
 and 0.9× (10% slower) on workload C"

Report all results honestly, including where you perform worse.

Structure of a Benchmark Report

A complete benchmark report should include:

1. Executive Summary

One paragraph summarizing the key findings. For people who don't have time to read the full report.

2. Test Environment

## Test Environment

- **Hardware**: Intel Xeon E5-2690 v4 @ 2.6GHz, 128GB RAM
- **OS**: Ubuntu 22.04 LTS, kernel 5.15.0
- **Compiler**: GCC 11.2 with -O3
- **Date**: 2024-01-15

3. Methodology

  • How many runs?
  • How many warm-up iterations?
  • How were outliers handled?
  • What statistical methods were used?

4. Results

Charts + data tables. Charts give visual impression; tables give precise numbers.

5. Analysis

Explain the results. Why is A faster than B? Where are the bottlenecks?

6. Limitations

Honestly state test limitations.

## Limitations

- Tests performed on a single machine; results may vary on different hardware
- Only tested with synthetic workloads; real-world performance may differ
- Memory-bound workloads not covered in this benchmark

7. Raw Data

Provide raw data for readers to analyze themselves (in appendix or via link).

Practical Tools

Simple Charts: gnuplot

set terminal png size 800,600
set output 'benchmark.png'
set title 'Algorithm Performance'
set xlabel 'Data Size'
set ylabel 'Time (ms)'
set style data linespoints
plot 'data.txt' using 1:2 title 'Algorithm A', \
     'data.txt' using 1:3 title 'Algorithm B'

Statistical Charts: Python + matplotlib

import matplotlib.pyplot as plt
import numpy as np

data_a = [100, 102, 98, 105, 97, ...]
data_b = [95, 93, 97, 94, 96, ...]

fig, ax = plt.subplots()
bp = ax.boxplot([data_a, data_b], labels=['Algorithm A', 'Algorithm B'])
ax.set_ylabel('Latency (μs)')
ax.set_title('Latency Comparison')
plt.savefig('comparison.png', dpi=150)

Interactive: Jupyter Notebook

Jupyter Notebooks let you combine code, data, charts, and analysis text in one place—easy to reproduce and share.

Back to Mark's Story

After that failure, our team established visualization standards:

  1. Y-axis starts at zero (unless explicitly labeled)
  2. Always include error bars
  3. State sample size and test environment
  4. Provide raw data
  5. Report all results honestly, including bad ones

Six months later, we had another chance to present to the same client. This time our charts didn't look as "impressive," but the CTO said:

"This is how I want to see data presented. Your improvement is 35%, and your chart clearly shows 35%—no more, no less. That makes me trust your other data too."

We got that contract.

Summary

Presenting benchmark results correctly is as important as measuring correctly. This chapter covered:

Avoiding Misleading Charts

  • Don't truncate Y-axis (unless clearly labeled)
  • Don't cherry-pick data ranges
  • Avoid 3D effects and misleading area comparisons

Correct Visualization

  • Always show error bars
  • State sample size and test conditions
  • Provide raw data or distributions

Choosing the Right Chart

  • Bar chart: compare discrete categories
  • Line chart: show trends
  • Scatter plot: correlation
  • Box plot: compare distributions
  • Heatmap: multi-dimensional data

Fair Comparisons

  • Compare under identical conditions
  • Choose a reasonable baseline
  • Report all results, not just favorable ones

Complete Reports

  • Executive summary
  • Test environment and methodology
  • Results and analysis
  • Limitations
  • Raw data