Chapter 27: LLM Performance Analysis

Part VII: AI/HPC

"The best way to predict the future is to invent it." — Alan Kay

The Mystery of the Stuttering Chatbot

Priya's team had deployed their LLM-powered customer service chatbot. The initial demo went great—responses were fast and coherent. But in production, users started complaining.

"Sometimes it takes forever to start responding," one support ticket read. "And then when it does start, the words come out in bursts—fast for a bit, then pause, then fast again."

Priya looked at the metrics dashboard. Average response latency: 2.3 seconds. That seemed fine. But the P99 was 8.7 seconds, and users were experiencing something the dashboard couldn't capture: the feeling of waiting.

She started instrumenting more carefully. The "slow start" issue was easy to identify: long prompts (users pasting entire documents) caused extended prefill times. But the "stuttering" was mysterious. The tokens-per-second metric looked stable.

After a week of investigation, she found the cause: garbage collection in the KV cache management code. Every 50 tokens or so, the system would pause to clean up old cache entries. Each pause was only 100ms, but users noticed it as unnatural hesitation.

"Traditional latency metrics don't capture this," she realized. "LLM inference isn't like a web request. Users experience it as a conversation, and conversations have rhythm."

This chapter explores the unique performance characteristics of LLM inference—characteristics that traditional benchmarking approaches fail to capture.

Why LLM Inference Is Different

LLM inference differs from traditional AI inference in fundamental ways that affect every aspect of performance analysis.

Autoregressive Generation: One Token at a Time

When you ask a vision model to classify an image, it produces the answer in one forward pass. But when you ask an LLM to write a paragraph, it generates that paragraph one token at a time—each token requiring a complete forward pass through the model.

Traditional AI:  Input → Model → Complete Output (one pass)

LLM Inference:   Input → Model → Token 1
                         Model → Token 2
                         Model → Token 3
                          ...
                         Model → Token N

A 100-token response requires 100 forward passes. This fundamentally changes the performance characteristics.

Two Distinct Phases with Different Bottlenecks

LLM inference has two phases with completely different performance profiles:

Prefill Phase (processing the input prompt):

Processes all input tokens in parallel
Compute-intensive: GPU tensor cores are busy
Latency scales with prompt length
Generates the initial KV cache

Decode Phase (generating the output):

Generates one token at a time
Memory-intensive: waiting for data transfer
Latency is relatively constant per token
Reads and updates the KV cache

This split creates a key insight: TTFT (Time To First Token) and TPS (Tokens Per Second) are largely independent metrics because they're dominated by different phases.

Why Decode Is Memory-Bound

During the decode phase, something counterintuitive happens: a 70-billion parameter model, running on hardware capable of 2000 TFLOPS, achieves only a tiny fraction of its theoretical compute performance.

The reason is arithmetic intensity. To generate one token, you must:

Read 70 billion parameters from memory (~140 GB at FP16)
Read the KV cache (potentially tens of GB more)
Perform matrix operations for exactly 1 token
Write the new KV cache entry

The computation is minimal compared to the data movement. Even with 3 TB/s memory bandwidth (H100), reading 140 GB takes ~47ms. That's your per-token latency floor for single-user inference.

Core Performance Metrics

TTFT (Time To First Token)

TTFT = Time from receiving request to outputting first token

Components:
  TTFT = Network latency + Queue time + Prefill time

Influencing factors:
  - Prompt length (primary)
  - Model size
  - Hardware performance
  - System load

Typical values (7B model, single GPU):
  Prompt 100 tokens:   50-100 ms
  Prompt 1000 tokens:  200-500 ms
  Prompt 4000 tokens:  500-2000 ms

TPS / Throughput

TPS = Tokens Per Second (tokens generated per second)

Two definitions:
  1. Single-request TPS: Generation speed for one request
  2. System TPS: Total throughput of entire system

Single-request TPS (7B model, single GPU):
  FP16:  30-50 tokens/sec
  INT8:  50-80 tokens/sec
  INT4:  80-120 tokens/sec

System TPS (depends on batch size):
  Batch 1:   30-50 tokens/sec
  Batch 32:  500-1000 tokens/sec
  Batch 128: 1500-3000 tokens/sec

TPOT (Time Per Output Token)

TPOT = Time to generate each output token
     = 1 / Single-request TPS

TPOT determines user-perceived "typing speed"

Typical values:
  Fast (good experience):    < 50 ms/token
  Acceptable:                50-100 ms/token
  Slow (poor experience):    > 100 ms/token

End-to-End Latency

Total Latency = TTFT + (N × TPOT)

Where N = number of output tokens

Example:
  TTFT = 100 ms
  TPOT = 30 ms
  N = 100 tokens

  Total = 100 + (100 × 30) = 3100 ms = 3.1 seconds

KV Cache

What Is KV Cache

KV Cache is the most important optimization technique in LLM inference:

vLLM and PagedAttention

PagedAttention Principle

PagedAttention introduced by vLLM solves the KV Cache fragmentation problem:

Traditional approach:
  Pre-allocate maximum sequence length KV Cache for each request
  Causes significant waste

PagedAttention:
  Divide KV Cache into fixed-size "pages"
  Allocate pages on demand
  Similar to OS virtual memory

Effect:
  - Memory utilization from ~50% to ~95%
  - Can serve more concurrent requests
  - Supports longer sequences

Using vLLM

# Install
pip install vllm

# Start API server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --tensor-parallel-size 1

# Use OpenAI-compatible API
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-hf",
        "prompt": "Hello, world!",
        "max_tokens": 100
    }'

vLLM Performance Tuning

Key parameters:

--gpu-memory-utilization 0.9
  GPU memory usage ratio (default 0.9)

--max-num-seqs 256
  Maximum concurrent sequences

--max-num-batched-tokens 8192
  Maximum tokens per iteration

--block-size 16
  KV Cache page size

--swap-space 4
  CPU swap space (GB)

Other LLM Inference Frameworks

TensorRT-LLM

NVIDIA's high-performance LLM inference framework:

Features:
  - Deeply optimized CUDA kernels
  - Supports Tensor Parallelism
  - Supports In-flight Batching
  - Integrates with Triton Inference Server

Performance:
  Usually 10-30% faster than vLLM (scenario dependent)

Text Generation Inference (TGI)

Hugging Face's inference framework:

Features:
  - Easy to use
  - Supports multiple models
  - Built-in Continuous Batching
  - Docker deployment friendly

llama.cpp

LLM inference for CPU and edge devices:

Features:
  - Pure C/C++ implementation
  - Supports quantization (GGUF format)
  - Runs on CPU, Apple Silicon
  - Low memory footprint

Performance Optimization Strategies

Batching Strategies

1. Static Batching
   - Fixed batch size
   - Wait for batch to fill or timeout
   - Simple but inefficient

2. Continuous Batching
   - Dynamically add/remove requests
   - Don't wait for batch to fill
   - Higher GPU utilization

3. In-flight Batching
   - Add new requests during decode
   - Maximize throughput

Quantization

Quantization impact on LLM:

Precision  Model Size  Memory BW   TPS Improvement
─────────────────────────────────────────────────
FP16       1.0x        1.0x        1.0x
INT8       0.5x        0.5x        ~1.5-2x
INT4       0.25x       0.25x       ~2-3x

Note:
  - Decode is memory-bound
  - Reducing data directly improves performance
  - But may affect output quality

Speculative Decoding

Principle:
  Use small model to "guess" multiple tokens
  Verify with large model
  If correct, accept multiple tokens at once

Effect:
  - Can improve TPS by 2-3x
  - Doesn't affect output quality
  - Requires additional small model

Priya's Dashboard, Revisited

After her investigation, Priya rebuilt the monitoring dashboard for the chatbot. The new version tracked metrics that actually mattered for user experience:

TTFT distribution (not just average, but P50/P95/P99)
ITL histogram (to catch stuttering)
KV cache memory pressure (to predict when GC would trigger)
Prefill queue depth (to predict TTFT spikes)

She also added a "smoothness score"—a custom metric that penalized high ITL variance, even when average TPS looked fine.

"The old dashboard said everything was fine," she told her manager. "The new one would have caught the stuttering issue on day one."

The lesson she learned applies beyond LLMs: the right metrics depend on the user experience you're trying to deliver. For a batch processing system, average throughput might be enough. For an interactive chatbot, you need to measure what users actually feel—and that means understanding the unique characteristics of your workload.

LLM inference isn't just "AI inference with more parameters." It's a fundamentally different workload with its own performance model, its own bottlenecks, and its own metrics. Master those, and you can build systems that don't just perform well on benchmarks—they feel fast to users.

Summary

LLM performance analysis has its unique metrics and challenges:

Core Metrics

TTFT: First token latency
TPS: Generation speed
TPOT: Time per token
ITL: Inter-token latency

Key Technologies

KV Cache: Avoid redundant computation
PagedAttention: Solve memory fragmentation
Continuous Batching: Improve throughput
Quantization: Reduce memory requirements

Inference Frameworks

vLLM: PagedAttention, high throughput
TensorRT-LLM: NVIDIA optimized
TGI: Hugging Face, easy to use
llama.cpp: CPU/edge devices

Optimization Directions

Prefill: Compute optimization (Tensor Cores)
Decode: Memory optimization (quantization, KV Cache)
System: Batching strategies

Performance and Benchmarking