Chapter 13: Lock-Free Data Structures

Part IV: Advanced Topics

“Locks are the goto statements of concurrent programming.” — Maurice Herlihy

The 60% Problem

The logging system was spending 60% of its time waiting for locks. Not doing useful work—just waiting.

Eight cores, all trying to write log messages to a shared circular buffer. The implementation was simple: protect the buffer with a mutex. Under heavy load, with all cores logging simultaneously, the profiler showed a devastating pattern: 60% of CPU cycles wasted in mutex operations.

Throughput: 850,000 messages per second. On an 8-core system, that should have been much higher.

“Can we do better without locks?” my manager asked during the performance review.

That question led to a complete redesign. The simple mutex-based approach:

typedef struct {
    char buffer[LOG_SIZE];
    int head;
    int tail;
    pthread_mutex_t lock;
} log_buffer_t;

void log_write(log_buffer_t *log, const char *msg) {
    pthread_mutex_lock(&log->lock);
    // Write message to buffer
    int next = (log->tail + 1) % LOG_SIZE;
    strcpy(&log->buffer[log->tail], msg);
    log->tail = next;
    pthread_mutex_unlock(&log->lock);
}

Simple, correct, and… slow.

Under heavy load (all 8 cores logging simultaneously), the system spent 60% of its time waiting for locks:

$ perf stat -e cycles,instructions,cache-misses ./logger_mutex
  Performance counter stats:
    8,500,000,000 cycles
    2,100,000,000 instructions
       45,000,000 cache-misses
       
Lock contention: 60% of cycles spent in mutex operations
Throughput: 850,000 messages/second

“Can we do better without locks?” my manager asked.

I implemented a lock-free ring buffer using atomic operations. The results:

$ perf stat -e cycles,instructions,cache-misses ./logger_lockfree
  Performance counter stats:
    3,200,000,000 cycles
    2,400,000,000 instructions
       12,000,000 cache-misses
       
Lock contention: 0%
Throughput: 2,400,000 messages/second

2.8× faster with zero lock contention. But the code was much more complex.

This chapter explores when lock-free data structures are worth the complexity.

The Textbook Story

Lock-free data structures promise:

No blocking: Threads never wait for locks
Better scalability: Performance improves with more cores
No deadlocks: Can’t deadlock without locks
Progress guarantees: At least one thread always makes progress

The textbook pitch sounds perfect for multi-core systems.

The Reality Check

Lock-free programming is hard. Here’s what the textbooks don’t emphasize:

1. Memory Ordering Is Subtle

On modern CPUs, memory operations can be reordered. Consider this “simple” lock-free flag:

// Thread 1
data = 42;
ready = 1;  // Signal that data is ready

// Thread 2
if (ready) {
    use(data);  // Might see old value of data!
}

Problem: The CPU might reorder the writes in Thread 1, so Thread 2 sees ready = 1 before data = 42.

Solution: Memory barriers (fences):

// Thread 1
data = 42;
__atomic_store_n(&ready, 1, __ATOMIC_RELEASE);  // Release barrier

// Thread 2
if (__atomic_load_n(&ready, __ATOMIC_ACQUIRE)) {  // Acquire barrier
    use(data);  // Now guaranteed to see data = 42
}

2. ABA Problem

The classic lock-free stack has a subtle bug:

typedef struct node {
    int value;
    struct node *next;
} node_t;

node_t *top;  // Stack top

void push(int value) {
    node_t *new_node = malloc(sizeof(node_t));
    new_node->value = value;
    do {
        new_node->next = top;
    } while (!__atomic_compare_exchange_n(&top, &new_node->next, new_node,
                                          0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST));
}

The ABA problem:

Thread 1 reads top = A
Thread 2 pops A, pops B, pushes A back (top is A again)
Thread 1’s CAS succeeds (top is still A), but the stack is corrupted!

Solution: Use tagged pointers or hazard pointers (complex!).

3. Cache Line Ping-Pong

Lock-free doesn’t mean cache-friendly. Consider a lock-free counter:

atomic_int counter = 0;

// 8 threads all incrementing
__atomic_fetch_add(&counter, 1, __ATOMIC_SEQ_CST);

Every increment causes a cache line to bounce between cores:

Core 0: Read counter (cache miss)
Core 0: Increment, write back
Core 1: Read counter (cache miss - invalidated by Core 0)
Core 1: Increment, write back
Core 2: Read counter (cache miss - invalidated by Core 1)
...

Result: Worse than a mutex for high contention!

I measured this with our logging system:

8 cores, atomic counter:
  Cache misses: 45M (same as mutex!)
  Throughput: 900K ops/sec (barely better than mutex)

8 cores, per-core counters (no sharing):
  Cache misses: 2M
  Throughput: 6.5M ops/sec

Lesson: Avoid sharing atomic variables across cores when possible.

Lock-Free Ring Buffer: A Practical Example

Let me show you the lock-free ring buffer I implemented for our logging system.

The Design

Key insight: Separate read and write indices, use atomic operations only for index updates.

typedef struct {
    char buffer[LOG_SIZE];
    atomic_int head;  // Read index
    atomic_int tail;  // Write index
} lockfree_log_t;

bool log_write(lockfree_log_t *log, const char *msg, int len) {
    int current_tail, next_tail, current_head;

    do {
        current_tail = __atomic_load_n(&log->tail, __ATOMIC_ACQUIRE);
        next_tail = (current_tail + len) % LOG_SIZE;
        current_head = __atomic_load_n(&log->head, __ATOMIC_ACQUIRE);

        // Check if buffer is full
        if (next_tail == current_head) {
            return false;  // Buffer full
        }

    } while (!__atomic_compare_exchange_n(&log->tail, &current_tail, next_tail,
                                          0, __ATOMIC_RELEASE, __ATOMIC_ACQUIRE));

    // Now we own the range [current_tail, next_tail)
    memcpy(&log->buffer[current_tail], msg, len);

    return true;
}

Why This Works

CAS on tail: Only one thread can claim a range
No lock on data: After claiming range, write without contention
Memory ordering: ACQUIRE/RELEASE ensures visibility

The Benchmark

I compared three implementations:

Test: 8 cores, 10M log messages

Mutex-based:
  Cycles: 8.5B
  Cache misses: 45M
  Throughput: 850K msg/sec
  Lock contention: 60%

Spinlock-based:
  Cycles: 7.2B
  Cache misses: 52M (worse!)
  Throughput: 1.1M msg/sec
  Lock contention: 45%

Lock-free (atomic CAS):
  Cycles: 3.2B
  Cache misses: 12M
  Throughput: 2.4M msg/sec
  Lock contention: 0%

The lock-free version was 2.8× faster than mutex, 2.2× faster than spinlock.

Why:

No syscalls (mutex uses futex)
No spinning (spinlock wastes cycles)
Less cache coherence traffic (only indices are atomic)

When Lock-Free Makes Sense

After implementing several lock-free data structures, I learned when they’re worth the complexity.

1. High Contention, Short Critical Sections

If threads are constantly fighting for the same resource, and the work inside is quick (a few instructions), lock-free can win.

Example: Our logging system

High contention: 8 cores logging simultaneously
Short critical section: Just update an index
Result: 2.8× speedup

2. Real-Time Systems

In hard real-time systems, you can’t afford priority inversion (low-priority thread holds lock, blocks high-priority thread).

Lock-free structures provide wait-free or lock-free progress guarantees.

Example: Interrupt handlers

Can’t block in interrupt context
Lock-free queues allow interrupt → thread communication
Used in Linux kernel’s ring buffers

3. Read-Heavy Workloads

If reads vastly outnumber writes, RCU (Read-Copy-Update) can eliminate read-side locks entirely.

Example: Linux kernel’s RCU

Readers: No locks, no atomic operations, just memory barriers
Writers: Rare, use synchronization
Result: Millions of reads/second with zero contention

4. When NOT to Use Lock-Free

Don’t use lock-free when:

Low contention: If threads rarely conflict, a mutex is simpler and just as fast.

Test: 2 cores, low contention
  Mutex: 1.2M ops/sec
  Lock-free: 1.3M ops/sec (only 8% faster, not worth complexity)

Complex operations: If the critical section is large (many instructions), lock-free becomes extremely complex.

Debugging: Lock-free bugs are notoriously hard to reproduce and debug. Use locks unless you have a proven performance problem.

Real-World Example: Linux Kernel’s Per-CPU Variables

The Linux kernel uses a clever trick to avoid atomic operations: per-CPU variables.

Instead of one shared counter:

atomic_int global_counter;  // Shared, causes cache ping-pong

Use one counter per CPU:

DEFINE_PER_CPU(int, local_counter);  // One per CPU, no sharing

void increment_counter(void) {
    int cpu = smp_processor_id();
    per_cpu(local_counter, cpu)++;  // No atomic needed!
}

int read_total(void) {
    int total = 0;
    for_each_possible_cpu(cpu) {
        total += per_cpu(local_counter, cpu);
    }
    return total;
}

Why this works:

Each CPU has its own cache line
No cache coherence traffic
Reads are rare (only when you need the total)

I used this pattern in our logging system for statistics:

Per-CPU counters (messages logged per core):
  Cache misses: 0 (each core has its own cache line)
  Throughput: 8M increments/sec (8 cores × 1M each)

Shared atomic counter:
  Cache misses: 45M
  Throughput: 900K increments/sec

8.9× faster by avoiding sharing!

Memory Ordering on RISC-V

RISC-V has a relaxed memory model (RVWMO - RISC-V Weak Memory Ordering). This means you need explicit fences.

Fence Instructions

# Full fence (all memory operations)
fence rw, rw

# Acquire fence (load-load, load-store)
fence r, rw

# Release fence (load-store, store-store)
fence rw, w

C11 Atomics on RISC-V

GCC maps C11 atomics to RISC-V instructions:

// __ATOMIC_ACQUIRE
__atomic_load_n(&x, __ATOMIC_ACQUIRE);

Compiles to:

ld   a0, 0(a1)      # Load
fence r, rw         # Acquire fence

// __ATOMIC_RELEASE
__atomic_store_n(&x, 42, __ATOMIC_RELEASE);

Compiles to:

fence rw, w         # Release fence
sd   a0, 0(a1)      # Store

The Cost of Fences

Fences aren’t free. I measured the overhead:

Test: 1M atomic loads (RISC-V U74 @ 1.2 GHz)

Relaxed (no fence):
  Cycles: 1.2M (1 cycle per load)

Acquire (fence r, rw):
  Cycles: 8.5M (8.5 cycles per load)

Sequential (fence rw, rw):
  Cycles: 12.3M (12.3 cycles per load)

Lesson: Use the weakest memory ordering that’s correct. Don’t default to __ATOMIC_SEQ_CST.

Summary

The 60% problem was solved. The lock-free ring buffer increased throughput from 850,000 to 2.4 million messages per second—a 2.8× improvement. Lock contention dropped from 60% to zero. But the code became significantly more complex, requiring careful attention to memory ordering and the ABA problem.

Key insights:

Lock-free isn’t always faster. For low contention or complex critical sections, mutexes are simpler and just as fast.
Memory ordering is subtle. You need ACQUIRE/RELEASE barriers to ensure visibility across cores. RISC-V’s relaxed memory model makes this explicit.
Cache coherence matters. Atomic operations cause cache line ping-pong. Avoid sharing atomic variables across cores when possible.
The ABA problem is real. Lock-free stacks and queues need tagged pointers or hazard pointers to avoid corruption.
Per-CPU variables eliminate contention. If you can partition data by CPU, you avoid atomic operations entirely. This is often 5-10× faster than shared atomics.

The numbers from our logging system:

Mutex: 850K msg/sec, 60% lock contention
Lock-free: 2.4M msg/sec, 0% contention
Per-CPU stats: 8M increments/sec (vs 900K with shared atomic)

Lock-free data structures are a powerful tool, but use them only when profiling shows lock contention is a real bottleneck. The complexity cost is high.

Next chapter: String processing and cache-efficient algorithms for text manipulation.

Keyboard shortcuts

Data Structures in Practice