Chapter 30: Case Study: Web Server Optimization

Part VIII: Case Studies


"Premature optimization is the root of all evil. But premature pessimization is the root of all slowness." — Adapted from Donald Knuth

The Story of "Fast SSD" That Was Still Slow

Our API server handled static file requests. We had the latest NVMe SSD—rated at 7 GB/s read speed, 1 million IOPS.

But measured: average response time 50ms, peak throughput only 2,000 req/s.

"The SSD is so fast, how can it be this slow?"

After a week of debugging, we found the problem wasn't the SSD, but:

  1. Sync I/O: Each request blocked waiting for I/O completion
  2. Small files: Lots of 4KB requests, IOPS was the bottleneck
  3. Syscall overhead: Every read() is a syscall
  4. Context switches: Thread-per-request model

This chapter walks through analyzing a web server's performance using all the tools we've learned.

Scenario Setup

System Specs

Server:
- CPU: AMD EPYC 7543 (32 cores, 64 threads)
- RAM: 256 GB DDR4-3200
- Storage: Samsung PM9A3 NVMe SSD (7.68 TB)
  - Sequential Read: 6.9 GB/s
  - Random Read IOPS: 1,000,000 (4KB)
- Network: Mellanox ConnectX-6 (100 Gbps)
- OS: Ubuntu 22.04, Kernel 5.15

Application:
- Nginx + upstream API server
- Main workload: static files + JSON API
- Target: 50,000 req/s, P99 < 10ms

Initial State

# Benchmark with wrk
wrk -t12 -c400 -d30s http://server/api/users

Running 30s test @ http://server/api/users
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    45.23ms   67.89ms 523.12ms   87.65%
    Req/Sec   912.34    234.56    1.89k    72.34%
  328234 requests in 30.01s, 1.23GB read
Requests/sec:  10937.23
Transfer/sec:     42.01MB

Problems:

  • Throughput: 10,937 req/s (target 50,000)
  • P99 latency: estimated > 200ms (target < 10ms)
  • Gap: 5×

Step 1: Find the Bottleneck

CPU or I/O?

# Check CPU usage
mpstat -P ALL 1

# Result
CPU    %usr   %sys   %iowait   %idle
all    12.3   18.7     3.2     65.8

# CPU only ~31% used, lots of idle
# This is not CPU-bound
# Check I/O
iostat -x 1

Device         r/s     rkB/s   await  %util
nvme0n1     8234.00  32936.00    0.12   8.2%

# SSD only 8.2% utilized, not I/O-bound either

Conclusion: CPU, I/O, Network all not saturated. Problem is in the "software layer."

Use perf to Find Where CPU Time Goes

perf record -g -p $(pgrep -f "api_server") -- sleep 30
perf report

# Result
  35.2%  api_server  libc.so.6       [.] __GI___poll
  18.7%  api_server  [kernel]        [k] system_call_fastpath
  12.3%  api_server  libc.so.6       [.] malloc
   8.9%  api_server  api_server      [.] json_serialize
   6.5%  api_server  libc.so.6       [.] __GI___read
   ...

Findings:

  • 35% time in poll()—waiting for I/O events
  • 18.7% in syscall—too many system calls
  • 12.3% in malloc—frequent memory allocation

Use strace to See Syscall Pattern

strace -c -p $(pgrep -f "api_server") -f

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 32.15    1.234567         2.1    587654           poll
 24.89    0.956789         1.8    531234           read
 18.76    0.721234         2.3    313567           write
 12.34    0.474123         3.2    148234           open
  8.21    0.315678         2.9    108876           close
  3.65    0.140234         1.5     93489           fstat

Each request approximately:

  • 1 poll
  • 1 open
  • 1+ read
  • 1+ write
  • 1 close
  • 1 fstat

At least 6 syscalls per request. 10,000 req/s = 60,000 syscall/s.

Step 2: Optimize One by One

Optimization 1: Reduce Syscalls (sendfile)

Original flow:

Optimization 2: io_uring Instead of epoll

Traditional epoll pattern:

while (1) {
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);  // syscall
    for (int i = 0; i < n; i++) {
        if (events[i].events & EPOLLIN) {
            read(fd, buf, size);   // syscall
            process(buf);
            write(fd, response, len);  // syscall
        }
    }
}

Each I/O operation is a separate syscall.

Using io_uring:

// Setup io_uring
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);

// Batch submit multiple I/O
struct io_uring_sqe *sqe;
for (int i = 0; i < batch_size; i++) {
    sqe = io_uring_get_sqe(&ring);
    io_uring_prep_read(sqe, fds[i], bufs[i], sizes[i], 0);
}

// One syscall submits all I/O
io_uring_submit(&ring);  // 1 syscall for N operations!

io_uring advantages:

  • Batch submission, fewer syscalls
  • Shared memory, avoids copying
  • Supports zero-copy (IORING_OP_SEND_ZC)

Result

Before: 18,234 req/s
After:  32,456 req/s (+78%)

Optimization 3: Memory Pool (Reduce malloc)

Original: malloc/free for each request:

void handle_request(int fd) {
    char *buffer = malloc(4096);  // malloc every time
    read(fd, buffer, 4096);
    char *response = malloc(response_size);  // malloc again
    build_response(buffer, response);
    write(fd, response, response_size);
    free(response);
    free(buffer);
}

Using memory pool:

// Thread-local buffer pool
static __thread struct {
    char request_buf[4096];
    char response_buf[65536];
} buffers;

void handle_request(int fd) {
    read(fd, buffers.request_buf, 4096);  // Reuse buffer
    build_response(buffers.request_buf, buffers.response_buf);
    write(fd, buffers.response_buf, response_size);
    // No free needed
}

Result

Before: 32,456 req/s
After:  41,234 req/s (+27%)

Optimization 4: Connection Pooling and Keep-Alive

Cost of each new connection:

TCP three-way handshake:  ~1 RTT
TLS handshake:            ~2 RTT (TLS 1.2) or 1 RTT (TLS 1.3)
Connection setup:         ~100μs

For short requests, this overhead can be longer than processing itself

Enable HTTP Keep-Alive:

# nginx.conf
http {
    keepalive_timeout 65;
    keepalive_requests 1000;  # Max 1000 requests per connection

    upstream backend {
        server 127.0.0.1:8080;
        keepalive 128;  # Keep connections to upstream too
    }
}

Result

Before: 41,234 req/s
After:  48,567 req/s (+18%)

Step 3: System-Level Tuning

TCP Tuning

# /etc/sysctl.conf

# Increase socket buffer
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Increase connection backlog
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# Fast TIME_WAIT recycling
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15

# Enable TCP Fast Open
net.ipv4.tcp_fastopen = 3

File Descriptor Limits

# /etc/security/limits.conf
* soft nofile 1000000
* hard nofile 1000000

# Or in systemd service
[Service]
LimitNOFILE=1000000

Final Results

Optimization Stage              Throughput      Improvement
────────────────────────────────────────────────────────────
Initial state                   10,937 req/s    baseline
+ sendfile                      18,234 req/s    +67%
+ io_uring                      32,456 req/s    +78%
+ memory pool                   41,234 req/s    +27%
+ keep-alive                    48,567 req/s    +18%
+ system tuning                 56,789 req/s    +17%
────────────────────────────────────────────────────────────
Total                           56,789 req/s    +419%

P99 latency: from 200+ms down to 8ms.

Exceeded target (50,000 req/s, P99 < 10ms).

Key Lessons

1. Bottleneck May Not Be Where You Think

Initial assumption: SSD too slow
Actual problem: syscall overhead, sync I/O, frequent malloc

Tools matter: perf, strace, bpftrace

2. Syscalls Are Expensive

One syscall ~100-1000 cycles
High throughput systems must reduce syscalls:
- Batch processing (io_uring)
- Zero-copy (sendfile)
- Avoid unnecessary calls (keep fd open)

3. Memory Allocation Is Hidden Cost

malloc/free itself isn't slow
But causes:
- Lock contention (multi-threaded)
- Cache pollution
- Memory fragmentation

Solution: memory pool, arena allocator

4. SSD Is Not Magic

SSD is fast, but:
- Each I/O has fixed overhead
- Queue depth matters
- Small I/O is inefficient
- Needs alignment

To fully utilize SSD performance:
- Async I/O
- High queue depth
- I/O coalescing
- Direct I/O (some scenarios)

Summary

Diagnostic Flow

  1. Identify bottleneck type (CPU/IO/Network)
  2. Use perf to find where CPU time goes
  3. Use strace to analyze syscall patterns
  4. Use bpftrace to see latency distribution

Optimization Techniques

ProblemSolution
Too many syscallssendfile, io_uring, batching
Sync I/Oio_uring, async I/O
Frequent mallocMemory pool, arena allocator
Connection overheadKeep-alive, connection pool
Low SSD efficiencyHigh queue depth, I/O coalescing

System Tuning

  • TCP buffer size
  • File descriptor limits
  • CPU affinity
  • NUMA awareness

Remember

"Fast" SSD + slow software = slow system
"Slow" HDD + good software = acceptable system

Software architecture determines whether hardware potential is realized