Chapter 11: Galactic Algorithms

Part III: Theory

"In theory, there is no difference between theory and practice. In practice, there is." — Yogi Berra

The Story of an O(n) Algorithm Losing to O(n²)

I once encountered a classic problem in an interview: find two numbers in an array that sum to a target value.

The candidate gave me an O(n²) brute-force solution. I smugly said, "This can be optimized to O(n) with a hash table."

The interviewer asked, "What if the array only has 10 elements?"

"Uh... O(n) is still faster, right?"

He shook his head. "Let's measure."

We measured. The O(n²) version was 3× faster.

Why? Because 10 elements in a nested loop means only 45 comparisons, while the hash table needs:

Hash computation (possibly more expensive than comparison)
Hash collision handling
Memory allocation
Cache misses (hash table isn't contiguous)

This is Big-O's blind spot: it only tells you behavior as n approaches infinity, but in reality n is often finite.

What Big-O Really Means

What Is Big-O

Big-O describes asymptotic behavior: the growth rate as n approaches infinity.

T(n) = 5n² + 3n + 100

As n → ∞:
- 100 can be ignored
- 3n can be ignored
- Coefficient 5 can be ignored

Conclusion: T(n) = O(n²)

What Big-O Ignores

1. Constant Factors

Algorithm A: T(n) = 1000n      → O(n)
Algorithm B: T(n) = n²         → O(n²)

Crossover point: 1000n = n²  →  n = 1000

When n < 1000, B is faster than A!

2. Lower-Order Terms

T(n) = n² + 1000000n

Big-O says this is O(n²), but when n < 1000000,
the linear term dominates execution time.

3. Actual Hardware Characteristics

Cache behavior
Branch prediction
SIMD friendliness
Memory allocation

Classic Example: Insertion Sort vs Merge Sort

Insertion Sort: O(n²)
Merge Sort:     O(n log n)

But:
- Insertion sort has small constant factors
- Merge sort needs extra space and copying
- Insertion sort is cache-friendly

Measured: n < 32, insertion sort is usually faster

This is why std::sort and Arrays.sort are hybrid sorts—quicksort/mergesort for large arrays, switching to insertion sort for small ones.

Galactic Algorithms: Theoretically Optimal Monsters

Some algorithms are "optimal" in Big-O terms but nobody actually uses them. These are called Galactic Algorithms—only faster at galactic-scale input sizes.

Example 1: Matrix Multiplication

Basic algorithm: O(n³)

for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            C[i][j] += A[i][k] * B[k][j];

Strassen's algorithm: O(n^2.807)

In 1969, Strassen discovered you could use 7 multiplications instead of 8, recursively.

Faster algorithms:

Year	Algorithm	Complexity
1969	Strassen	O(n^2.807)
1990	Coppersmith-Winograd	O(n^2.376)
2014	Le Gall	O(n^2.3728639)
2020	Alman-Williams	O(n^2.3728596)

What's actually used: Strassen (sometimes) + highly optimized O(n³)

Why not use O(n^2.37) algorithms?

Coppersmith-Winograd's constant factor estimated at over 10^20

Need n > 10^20 to be faster than Strassen
A 10^20 × 10^20 matrix needs more memory than atoms in the universe

Example 2: Integer Multiplication

Grade school algorithm: O(n²)

Karatsuba: O(n^1.585) - 1960

Schönhage-Strassen: O(n log n log log n) - FFT-based

Constant Factors: The Ignored Elephant

A Tale of Cycles

Suppose we compare two algorithms:

Algorithm A: 10n operations, each takes 1 cycle
Algorithm B: n operations, each takes 100 cycles

A total: 10n cycles
B total: 100n cycles

A is 10× faster, even though it has 10× more "operations"

What operations take 100 cycles?

Operation	Approximate cycles
Integer addition	1
Integer multiplication	3-4
Integer division	20-80
L1 cache hit	4
L2 cache hit	12
L3 cache hit	40
DRAM access	100-300
Branch misprediction	15-20
System call	1000+

Real Example: Hash Table vs Array

// Version 1: Hash table lookup O(1)
int find_hash(hash_table *ht, int key) {
    int hash = compute_hash(key);        // ~10 cycles
    int idx = hash % ht->size;           // ~20 cycles (division!)
    // Possible collision handling...
    return ht->buckets[idx];             // Possible cache miss
}

// Version 2: Linear search O(n)
int find_linear(int *arr, int n, int key) {
    for (int i = 0; i < n; i++)          // Sequential access, cache-friendly
        if (arr[i] == key) return i;
    return -1;
}

Crossover point is around n = 10-50, depending on implementation details.

Cache Impact

// Version 1: Random access
int sum_random(int *arr, int *indices, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++)
        sum += arr[indices[i]];  // Possible cache miss every time
    return sum;
}

// Version 2: Sequential access
int sum_sequential(int *arr, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++)
        sum += arr[i];  // Prefetcher can predict
    return sum;
}

Measured difference can be 10-50×, even though Big-O is O(n) for both.

Why Theory and Practice Diverge

Reason 1: Memory Hierarchy

Big-O assumes all memory accesses cost the same.

Theoretical model:
┌──────────────────────────────────┐
│              Memory              │
│         (uniform cost)           │
└──────────────────────────────────┘

Reality:
┌────────┐
│ L1 (4) │  ← 4 cycles
├────────┤
│ L2 (12)│  ← 12 cycles
├────────┤
│ L3 (40)│  ← 40 cycles
├────────┤
│RAM(200)│  ← 200 cycles
└────────┘

Memory hierarchy-friendly algorithms can beat theoretically faster ones.

Reason 2: Branch Prediction

// Version 1: With branches
int sum_positive(int *arr, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++)
        if (arr[i] > 0)
            sum += arr[i];
    return sum;
}

// Version 2: Branchless
int sum_positive_branchless(int *arr, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        int mask = arr[i] >> 31;  // 0 if positive, -1 if negative
        sum += arr[i] & ~mask;
    }
    return sum;
}

If positive/negative distribution is random, version 2 can be 2-3× faster (avoiding branch misprediction).

Reason 3: SIMD

// Scalar
for (int i = 0; i < n; i++)
    c[i] = a[i] + b[i];

// SIMD (AVX2)
for (int i = 0; i < n; i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);
    __m256 vb = _mm256_load_ps(&b[i]);
    __m256 vc = _mm256_add_ps(va, vb);
    _mm256_store_ps(&c[i], vc);
}

SIMD version can be 4-8× faster, but Big-O is O(n) for both.

Practical Advice

1. Don't Prematurely Optimize Complexity

Bad:  "This is O(n²), must change to O(n log n)!"
Good: "This is O(n²), what's the max n? 100? Doesn't matter."

2. Profile First, Then Optimize

Bad:  Spend a week switching to theoretically faster algorithm
Good: Use perf to find bottleneck is cache misses, adjust data layout

3. Know Your n

n Range	Optimization Strategy
n < 10	Write whatever, prioritize readability
n < 100	Simple algorithms, watch constant factors
n < 10000	Algorithm starts to matter
n > 100000	Algorithm very important
n > 10^9	Algorithm + parallelization + distributed

4. Consider Hybrid Methods

def smart_sort(arr):
    if len(arr) < 20:
        return insertion_sort(arr)
    else:
        return quicksort(arr)

5. Benchmark Real Workloads

Don't just test worst case or best case. Test:

Typical input sizes
Typical input distributions
On real hardware

Summary

Big-O Limitations

Ignores constant factors
Ignores lower-order terms
Assumes n → ∞
Doesn't consider hardware characteristics

Galactic Algorithms

Theoretically optimal
Practically unusable
Constant factors too large for the universe

Constant Factor Sources

Cache miss vs hit (10-100×)
Branch misprediction (15-20 cycles)
Division vs multiplication (10-20×)
System calls (1000+ cycles)

Practical Advice

Know your n
Profile first, then optimize
Consider hybrid methods
Test real workloads
Don't blindly trust Big-O

Remember

O(n) with cache misses < O(n²) with cache hits
...when n is small enough

O(n log n) Quicksort < O(n) Radix Sort
...when n is small enough

Theory is the map, measurement is the territory.
When map and territory disagree, trust the territory.

Performance and Benchmarking