Chapter 11: Galactic Algorithms
Part III: Theory
"In theory, there is no difference between theory and practice. In practice, there is." — Yogi Berra
The Story of an O(n) Algorithm Losing to O(n²)
I once encountered a classic problem in an interview: find two numbers in an array that sum to a target value.
The candidate gave me an O(n²) brute-force solution. I smugly said, "This can be optimized to O(n) with a hash table."
The interviewer asked, "What if the array only has 10 elements?"
"Uh... O(n) is still faster, right?"
He shook his head. "Let's measure."
We measured. The O(n²) version was 3× faster.
Why? Because 10 elements in a nested loop means only 45 comparisons, while the hash table needs:
- Hash computation (possibly more expensive than comparison)
- Hash collision handling
- Memory allocation
- Cache misses (hash table isn't contiguous)
This is Big-O's blind spot: it only tells you behavior as n approaches infinity, but in reality n is often finite.
What Big-O Really Means
What Is Big-O
Big-O describes asymptotic behavior: the growth rate as n approaches infinity.
T(n) = 5n² + 3n + 100
As n → ∞:
- 100 can be ignored
- 3n can be ignored
- Coefficient 5 can be ignored
Conclusion: T(n) = O(n²)
What Big-O Ignores
1. Constant Factors
Algorithm A: T(n) = 1000n → O(n)
Algorithm B: T(n) = n² → O(n²)
Crossover point: 1000n = n² → n = 1000
When n < 1000, B is faster than A!
2. Lower-Order Terms
T(n) = n² + 1000000n
Big-O says this is O(n²), but when n < 1000000,
the linear term dominates execution time.
3. Actual Hardware Characteristics
- Cache behavior
- Branch prediction
- SIMD friendliness
- Memory allocation
Classic Example: Insertion Sort vs Merge Sort
Insertion Sort: O(n²)
Merge Sort: O(n log n)
But:
- Insertion sort has small constant factors
- Merge sort needs extra space and copying
- Insertion sort is cache-friendly
Measured: n < 32, insertion sort is usually faster
This is why std::sort and Arrays.sort are hybrid sorts—quicksort/mergesort for large arrays, switching to insertion sort for small ones.
Galactic Algorithms: Theoretically Optimal Monsters
Some algorithms are "optimal" in Big-O terms but nobody actually uses them. These are called Galactic Algorithms—only faster at galactic-scale input sizes.
Example 1: Matrix Multiplication
Basic algorithm: O(n³)
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
C[i][j] += A[i][k] * B[k][j];
Strassen's algorithm: O(n^2.807)
In 1969, Strassen discovered you could use 7 multiplications instead of 8, recursively.
Faster algorithms:
| Year | Algorithm | Complexity |
|---|---|---|
| 1969 | Strassen | O(n^2.807) |
| 1990 | Coppersmith-Winograd | O(n^2.376) |
| 2014 | Le Gall | O(n^2.3728639) |
| 2020 | Alman-Williams | O(n^2.3728596) |
What's actually used: Strassen (sometimes) + highly optimized O(n³)
Why not use O(n^2.37) algorithms?
Coppersmith-Winograd's constant factor estimated at over 10^20
Need n > 10^20 to be faster than Strassen
A 10^20 × 10^20 matrix needs more memory than atoms in the universe
Example 2: Integer Multiplication
Grade school algorithm: O(n²)
Karatsuba: O(n^1.585) - 1960
Schönhage-Strassen: O(n log n log log n) - FFT-based
Constant Factors: The Ignored Elephant
A Tale of Cycles
Suppose we compare two algorithms:
Algorithm A: 10n operations, each takes 1 cycle
Algorithm B: n operations, each takes 100 cycles
A total: 10n cycles
B total: 100n cycles
A is 10× faster, even though it has 10× more "operations"
What operations take 100 cycles?
| Operation | Approximate cycles |
|---|---|
| Integer addition | 1 |
| Integer multiplication | 3-4 |
| Integer division | 20-80 |
| L1 cache hit | 4 |
| L2 cache hit | 12 |
| L3 cache hit | 40 |
| DRAM access | 100-300 |
| Branch misprediction | 15-20 |
| System call | 1000+ |
Real Example: Hash Table vs Array
// Version 1: Hash table lookup O(1)
int find_hash(hash_table *ht, int key) {
int hash = compute_hash(key); // ~10 cycles
int idx = hash % ht->size; // ~20 cycles (division!)
// Possible collision handling...
return ht->buckets[idx]; // Possible cache miss
}
// Version 2: Linear search O(n)
int find_linear(int *arr, int n, int key) {
for (int i = 0; i < n; i++) // Sequential access, cache-friendly
if (arr[i] == key) return i;
return -1;
}
Crossover point is around n = 10-50, depending on implementation details.
Cache Impact
// Version 1: Random access
int sum_random(int *arr, int *indices, int n) {
int sum = 0;
for (int i = 0; i < n; i++)
sum += arr[indices[i]]; // Possible cache miss every time
return sum;
}
// Version 2: Sequential access
int sum_sequential(int *arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++)
sum += arr[i]; // Prefetcher can predict
return sum;
}
Measured difference can be 10-50×, even though Big-O is O(n) for both.
Why Theory and Practice Diverge
Reason 1: Memory Hierarchy
Big-O assumes all memory accesses cost the same.
Theoretical model:
┌──────────────────────────────────┐
│ Memory │
│ (uniform cost) │
└──────────────────────────────────┘
Reality:
┌────────┐
│ L1 (4) │ ← 4 cycles
├────────┤
│ L2 (12)│ ← 12 cycles
├────────┤
│ L3 (40)│ ← 40 cycles
├────────┤
│RAM(200)│ ← 200 cycles
└────────┘
Memory hierarchy-friendly algorithms can beat theoretically faster ones.
Reason 2: Branch Prediction
// Version 1: With branches
int sum_positive(int *arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++)
if (arr[i] > 0)
sum += arr[i];
return sum;
}
// Version 2: Branchless
int sum_positive_branchless(int *arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++) {
int mask = arr[i] >> 31; // 0 if positive, -1 if negative
sum += arr[i] & ~mask;
}
return sum;
}
If positive/negative distribution is random, version 2 can be 2-3× faster (avoiding branch misprediction).
Reason 3: SIMD
// Scalar
for (int i = 0; i < n; i++)
c[i] = a[i] + b[i];
// SIMD (AVX2)
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]);
__m256 vb = _mm256_load_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_store_ps(&c[i], vc);
}
SIMD version can be 4-8× faster, but Big-O is O(n) for both.
Practical Advice
1. Don't Prematurely Optimize Complexity
Bad: "This is O(n²), must change to O(n log n)!"
Good: "This is O(n²), what's the max n? 100? Doesn't matter."
2. Profile First, Then Optimize
Bad: Spend a week switching to theoretically faster algorithm
Good: Use perf to find bottleneck is cache misses, adjust data layout
3. Know Your n
| n Range | Optimization Strategy |
|---|---|
| n < 10 | Write whatever, prioritize readability |
| n < 100 | Simple algorithms, watch constant factors |
| n < 10000 | Algorithm starts to matter |
| n > 100000 | Algorithm very important |
| n > 10^9 | Algorithm + parallelization + distributed |
4. Consider Hybrid Methods
def smart_sort(arr):
if len(arr) < 20:
return insertion_sort(arr)
else:
return quicksort(arr)
5. Benchmark Real Workloads
Don't just test worst case or best case. Test:
- Typical input sizes
- Typical input distributions
- On real hardware
Summary
Big-O Limitations
- Ignores constant factors
- Ignores lower-order terms
- Assumes n → ∞
- Doesn't consider hardware characteristics
Galactic Algorithms
- Theoretically optimal
- Practically unusable
- Constant factors too large for the universe
Constant Factor Sources
- Cache miss vs hit (10-100×)
- Branch misprediction (15-20 cycles)
- Division vs multiplication (10-20×)
- System calls (1000+ cycles)
Practical Advice
- Know your n
- Profile first, then optimize
- Consider hybrid methods
- Test real workloads
- Don't blindly trust Big-O
Remember
O(n) with cache misses < O(n²) with cache hits
...when n is small enough
O(n log n) Quicksort < O(n) Radix Sort
...when n is small enough
Theory is the map, measurement is the territory.
When map and territory disagree, trust the territory.