Chapter 31: Case Study: Database Query Optimization

Part VIII: Case Studies


"The fastest query is the one you don't have to make." — Unknown DBA

The Story of "Cross-Datacenter Query" That Could Only Run 10 Times Per Second

Our service needed to read data from MySQL in another datacenter. Network latency was about 20ms (speed of light limitation).

Simple code:

def get_user_orders(user_id):
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id)
    for order in orders:
        items = db.query("SELECT * FROM items WHERE order_id = ?", order.id)
        order.items = items
    return user, orders

A user has 10 orders, each order has 5 items.

Total: 1 + 1 + 10 = 12 queries.

Each query 20ms RTT → total latency 240ms.

Can only handle 4 requests per second (single thread).

This is the classic N+1 query problem, amplified by network latency.

Network Latency: The Underestimated Killer

Speed of Light Limits

Location              Distance    Fiber Latency (one-way)  RTT
───────────────────────────────────────────────────────────────
Same datacenter       < 1 km      ~0.005 ms               ~0.01 ms
Same city             ~50 km      ~0.25 ms                ~0.5 ms
Cross-city            ~350 km     ~1.75 ms                ~3.5 ms
Cross-country         ~2000 km    ~10 ms                  ~20 ms
Cross-continent       ~10000 km   ~50 ms                  ~100 ms

This is a physical limit, cannot be optimized. The only solution is to reduce round trip count.

Little's Law Again

Throughput = Concurrency / Latency

If RTT = 20ms, single connection:
Throughput = 1 / 0.02 = 50 queries/sec

To reach 1000 queries/sec:
Concurrency = 1000 × 0.02 = 20 parallel connections

Problem Analysis

Original Code Problems

# Classic N+1 problem
users = db.query("SELECT * FROM users LIMIT 100")
for user in users:
    # One extra query per user
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user.id)
    user.orders = orders

# Total 101 queries!

Using EXPLAIN to Analyze

EXPLAIN SELECT * FROM orders WHERE user_id = 12345;

+----+-------------+--------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+--------+------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | orders | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+---------+-------------+

type = ALL → Full table scan! No index used.

Optimization Strategies

Optimization 1: Add Index

Most basic but most effective:

-- Check existing indexes
SHOW INDEX FROM orders;

-- Add missing index
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Verify
EXPLAIN SELECT * FROM orders WHERE user_id = 12345;

type = ref → Using index, scanning 10 rows (not 1M rows)

Optimization 2: Solve N+1 Problem

Method A: JOIN

-- Original: N+1 queries
SELECT * FROM users WHERE id = ?;
SELECT * FROM orders WHERE user_id = ?;

-- Optimized: 1 JOIN
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.id = ?;

Method B: Batch Query

# Original: N+1
users = db.query("SELECT * FROM users LIMIT 100")
for user in users:
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user.id)

# Optimized: 2 queries
users = db.query("SELECT * FROM users LIMIT 100")
user_ids = [u.id for u in users]
orders = db.query("SELECT * FROM orders WHERE user_id IN (?)", user_ids)

# Combine in application layer
orders_by_user = group_by(orders, 'user_id')
for user in users:
    user.orders = orders_by_user.get(user.id, [])

Method C: ORM Eager Loading

Optimization 4: Connection Pool

# Original: create connection for each query
def query(sql):
    conn = mysql.connect(host='db.server.com', ...)  # TCP + TLS handshake
    cursor = conn.cursor()
    cursor.execute(sql)
    result = cursor.fetchall()
    conn.close()
    return result

# Optimized: connection pool
from sqlalchemy import create_engine
engine = create_engine(
    'mysql://user:pass@db.server.com/mydb',
    pool_size=20,           # Keep 20 connections
    max_overflow=10,        # Up to 10 extra
    pool_recycle=3600,      # Recycle after 1 hour
    pool_pre_ping=True      # Test connection before use
)

Connection establishment cost:

TCP three-way handshake:  1 RTT (~20ms)
TLS handshake:            2 RTT (~40ms) for TLS 1.2
MySQL authentication:     1 RTT (~20ms)
──────────────────────────────────────────
Total:                    ~80ms per new connection

With connection pool, this cost is paid only once.

Caching Strategy

Multi-Layer Cache Architecture

┌─────────────┐
│ Application │
│   Cache     │ ← L1: In-process cache (fastest, small capacity)
└──────┬──────┘
       │
┌──────▼──────┐
│   Redis /   │ ← L2: Distributed cache (fast, medium capacity)
│  Memcached  │
└──────┬──────┘
       │
┌──────▼──────┐
│   Database  │ ← L3: Database (slow, large capacity)
│ Buffer Pool │
└─────────────┘

Cache-Aside Pattern

def get_user(user_id):
    # 1. Check cache first
    cache_key = f"user:{user_id}"
    cached = redis.get(cache_key)
    if cached:
        return deserialize(cached)

    # 2. Cache miss, query database
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)

    # 3. Write to cache
    redis.setex(cache_key, 3600, serialize(user))  # 1 hour expiry

    return user

Cache Invalidation

# Invalidate on update
def update_user(user_id, data):
    db.execute("UPDATE users SET ... WHERE id = ?", user_id)
    redis.delete(f"user:{user_id}")  # Delete cache

# Or: update cache on update
def update_user_with_cache(user_id, data):
    db.execute("UPDATE users SET ... WHERE id = ?", user_id)
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    redis.setex(f"user:{user_id}", 3600, serialize(user))

Practical Example: Optimizing Cross-Datacenter Query

Back to the original problem:

# Original: 12 queries, 240ms
def get_user_orders(user_id):
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id)
    for order in orders:
        items = db.query("SELECT * FROM items WHERE order_id = ?", order.id)
        order.items = items
    return user, orders

Optimized Version

def get_user_orders_optimized(user_id):
    # 1. Check cache first
    cache_key = f"user_orders:{user_id}"
    cached = redis.get(cache_key)
    if cached:
        return deserialize(cached)

    # 2. Single query to get all data
    result = db.execute("""
        SELECT u.*, o.*, i.*
        FROM users u
        LEFT JOIN orders o ON u.id = o.user_id
        LEFT JOIN items i ON o.id = i.order_id
        WHERE u.id = ?
    """, user_id)

    # 3. Assemble in application layer
    user, orders = assemble_result(result)

    # 4. Write to cache
    redis.setex(cache_key, 1800, serialize((user, orders)))

    return user, orders

Results

Original:
- 12 queries × 20ms = 240ms
- Throughput: 4 req/s (single thread)

Optimized:
- Cache hit: < 1ms (Redis in same datacenter)
- Cache miss: 1 query × 20ms = 20ms
- Throughput: 50+ req/s (single thread)
- With 90% cache hit rate: average ~3ms

Summary

Network Latency Is a Hard Limit

20ms RTT = max 50 queries/sec (single connection)
Solution: reduce round trips, increase parallelism

N+1 Problem

Symptom: N+1 queries
Solution: JOIN, batch query, ORM eager loading

Optimization Layers

LayerOptimization Method
QueryIndex, JOIN, batching
ConnectionConnection pool, multiplexing
ProtocolPipeline, compression
CacheL1/L2 cache, Cache-Aside
StorageBuffer pool, partitioning, SSD tuning
NetworkTCP tuning, BBR

Caching Strategies

Cache-Aside: Fill on read
Write-Through: Update on write
Write-Behind: Async write

Watch out for: penetration, breakdown, avalanche

Remember

Query count × RTT = minimum latency

1 good query > 10 simple queries
On high-latency networks, this difference is even more pronounced