Chapter 30: Case Study: Web Server Optimization
Part VIII: Case Studies
"Premature optimization is the root of all evil. But premature pessimization is the root of all slowness." — Adapted from Donald Knuth
The Story of "Fast SSD" That Was Still Slow
Our API server handled static file requests. We had the latest NVMe SSD—rated at 7 GB/s read speed, 1 million IOPS.
But measured: average response time 50ms, peak throughput only 2,000 req/s.
"The SSD is so fast, how can it be this slow?"
After a week of debugging, we found the problem wasn't the SSD, but:
- Sync I/O: Each request blocked waiting for I/O completion
- Small files: Lots of 4KB requests, IOPS was the bottleneck
- Syscall overhead: Every read() is a syscall
- Context switches: Thread-per-request model
This chapter walks through analyzing a web server's performance using all the tools we've learned.
Scenario Setup
System Specs
Server:
- CPU: AMD EPYC 7543 (32 cores, 64 threads)
- RAM: 256 GB DDR4-3200
- Storage: Samsung PM9A3 NVMe SSD (7.68 TB)
- Sequential Read: 6.9 GB/s
- Random Read IOPS: 1,000,000 (4KB)
- Network: Mellanox ConnectX-6 (100 Gbps)
- OS: Ubuntu 22.04, Kernel 5.15
Application:
- Nginx + upstream API server
- Main workload: static files + JSON API
- Target: 50,000 req/s, P99 < 10ms
Initial State
# Benchmark with wrk
wrk -t12 -c400 -d30s http://server/api/users
Running 30s test @ http://server/api/users
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 45.23ms 67.89ms 523.12ms 87.65%
Req/Sec 912.34 234.56 1.89k 72.34%
328234 requests in 30.01s, 1.23GB read
Requests/sec: 10937.23
Transfer/sec: 42.01MB
Problems:
- Throughput: 10,937 req/s (target 50,000)
- P99 latency: estimated > 200ms (target < 10ms)
- Gap: 5×
Step 1: Find the Bottleneck
CPU or I/O?
# Check CPU usage
mpstat -P ALL 1
# Result
CPU %usr %sys %iowait %idle
all 12.3 18.7 3.2 65.8
# CPU only ~31% used, lots of idle
# This is not CPU-bound
# Check I/O
iostat -x 1
Device r/s rkB/s await %util
nvme0n1 8234.00 32936.00 0.12 8.2%
# SSD only 8.2% utilized, not I/O-bound either
Conclusion: CPU, I/O, Network all not saturated. Problem is in the "software layer."
Use perf to Find Where CPU Time Goes
perf record -g -p $(pgrep -f "api_server") -- sleep 30
perf report
# Result
35.2% api_server libc.so.6 [.] __GI___poll
18.7% api_server [kernel] [k] system_call_fastpath
12.3% api_server libc.so.6 [.] malloc
8.9% api_server api_server [.] json_serialize
6.5% api_server libc.so.6 [.] __GI___read
...
Findings:
- 35% time in poll()—waiting for I/O events
- 18.7% in syscall—too many system calls
- 12.3% in malloc—frequent memory allocation
Use strace to See Syscall Pattern
strace -c -p $(pgrep -f "api_server") -f
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
32.15 1.234567 2.1 587654 poll
24.89 0.956789 1.8 531234 read
18.76 0.721234 2.3 313567 write
12.34 0.474123 3.2 148234 open
8.21 0.315678 2.9 108876 close
3.65 0.140234 1.5 93489 fstat
Each request approximately:
- 1 poll
- 1 open
- 1+ read
- 1+ write
- 1 close
- 1 fstat
At least 6 syscalls per request. 10,000 req/s = 60,000 syscall/s.
Step 2: Optimize One by One
Optimization 1: Reduce Syscalls (sendfile)
Original flow:
Optimization 2: io_uring Instead of epoll
Traditional epoll pattern:
while (1) {
int n = epoll_wait(epfd, events, MAX_EVENTS, -1); // syscall
for (int i = 0; i < n; i++) {
if (events[i].events & EPOLLIN) {
read(fd, buf, size); // syscall
process(buf);
write(fd, response, len); // syscall
}
}
}
Each I/O operation is a separate syscall.
Using io_uring:
// Setup io_uring
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);
// Batch submit multiple I/O
struct io_uring_sqe *sqe;
for (int i = 0; i < batch_size; i++) {
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fds[i], bufs[i], sizes[i], 0);
}
// One syscall submits all I/O
io_uring_submit(&ring); // 1 syscall for N operations!
io_uring advantages:
- Batch submission, fewer syscalls
- Shared memory, avoids copying
- Supports zero-copy (IORING_OP_SEND_ZC)
Result
Before: 18,234 req/s
After: 32,456 req/s (+78%)
Optimization 3: Memory Pool (Reduce malloc)
Original: malloc/free for each request:
void handle_request(int fd) {
char *buffer = malloc(4096); // malloc every time
read(fd, buffer, 4096);
char *response = malloc(response_size); // malloc again
build_response(buffer, response);
write(fd, response, response_size);
free(response);
free(buffer);
}
Using memory pool:
// Thread-local buffer pool
static __thread struct {
char request_buf[4096];
char response_buf[65536];
} buffers;
void handle_request(int fd) {
read(fd, buffers.request_buf, 4096); // Reuse buffer
build_response(buffers.request_buf, buffers.response_buf);
write(fd, buffers.response_buf, response_size);
// No free needed
}
Result
Before: 32,456 req/s
After: 41,234 req/s (+27%)
Optimization 4: Connection Pooling and Keep-Alive
Cost of each new connection:
TCP three-way handshake: ~1 RTT
TLS handshake: ~2 RTT (TLS 1.2) or 1 RTT (TLS 1.3)
Connection setup: ~100μs
For short requests, this overhead can be longer than processing itself
Enable HTTP Keep-Alive:
# nginx.conf
http {
keepalive_timeout 65;
keepalive_requests 1000; # Max 1000 requests per connection
upstream backend {
server 127.0.0.1:8080;
keepalive 128; # Keep connections to upstream too
}
}
Result
Before: 41,234 req/s
After: 48,567 req/s (+18%)
Step 3: System-Level Tuning
TCP Tuning
# /etc/sysctl.conf
# Increase socket buffer
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# Increase connection backlog
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
# Fast TIME_WAIT recycling
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
# Enable TCP Fast Open
net.ipv4.tcp_fastopen = 3
File Descriptor Limits
# /etc/security/limits.conf
* soft nofile 1000000
* hard nofile 1000000
# Or in systemd service
[Service]
LimitNOFILE=1000000
Final Results
Optimization Stage Throughput Improvement
────────────────────────────────────────────────────────────
Initial state 10,937 req/s baseline
+ sendfile 18,234 req/s +67%
+ io_uring 32,456 req/s +78%
+ memory pool 41,234 req/s +27%
+ keep-alive 48,567 req/s +18%
+ system tuning 56,789 req/s +17%
────────────────────────────────────────────────────────────
Total 56,789 req/s +419%
P99 latency: from 200+ms down to 8ms.
Exceeded target (50,000 req/s, P99 < 10ms).
Key Lessons
1. Bottleneck May Not Be Where You Think
Initial assumption: SSD too slow
Actual problem: syscall overhead, sync I/O, frequent malloc
Tools matter: perf, strace, bpftrace
2. Syscalls Are Expensive
One syscall ~100-1000 cycles
High throughput systems must reduce syscalls:
- Batch processing (io_uring)
- Zero-copy (sendfile)
- Avoid unnecessary calls (keep fd open)
3. Memory Allocation Is Hidden Cost
malloc/free itself isn't slow
But causes:
- Lock contention (multi-threaded)
- Cache pollution
- Memory fragmentation
Solution: memory pool, arena allocator
4. SSD Is Not Magic
SSD is fast, but:
- Each I/O has fixed overhead
- Queue depth matters
- Small I/O is inefficient
- Needs alignment
To fully utilize SSD performance:
- Async I/O
- High queue depth
- I/O coalescing
- Direct I/O (some scenarios)
Summary
Diagnostic Flow
- Identify bottleneck type (CPU/IO/Network)
- Use perf to find where CPU time goes
- Use strace to analyze syscall patterns
- Use bpftrace to see latency distribution
Optimization Techniques
| Problem | Solution |
|---|---|
| Too many syscalls | sendfile, io_uring, batching |
| Sync I/O | io_uring, async I/O |
| Frequent malloc | Memory pool, arena allocator |
| Connection overhead | Keep-alive, connection pool |
| Low SSD efficiency | High queue depth, I/O coalescing |
System Tuning
- TCP buffer size
- File descriptor limits
- CPU affinity
- NUMA awareness
Remember
"Fast" SSD + slow software = slow system
"Slow" HDD + good software = acceptable system
Software architecture determines whether hardware potential is realized