Chapter 27: LLM Performance Analysis
Part VII: AI/HPC
"The best way to predict the future is to invent it." — Alan Kay
The Mystery of the Stuttering Chatbot
Priya's team had deployed their LLM-powered customer service chatbot. The initial demo went great—responses were fast and coherent. But in production, users started complaining.
"Sometimes it takes forever to start responding," one support ticket read. "And then when it does start, the words come out in bursts—fast for a bit, then pause, then fast again."
Priya looked at the metrics dashboard. Average response latency: 2.3 seconds. That seemed fine. But the P99 was 8.7 seconds, and users were experiencing something the dashboard couldn't capture: the feeling of waiting.
She started instrumenting more carefully. The "slow start" issue was easy to identify: long prompts (users pasting entire documents) caused extended prefill times. But the "stuttering" was mysterious. The tokens-per-second metric looked stable.
After a week of investigation, she found the cause: garbage collection in the KV cache management code. Every 50 tokens or so, the system would pause to clean up old cache entries. Each pause was only 100ms, but users noticed it as unnatural hesitation.
"Traditional latency metrics don't capture this," she realized. "LLM inference isn't like a web request. Users experience it as a conversation, and conversations have rhythm."
This chapter explores the unique performance characteristics of LLM inference—characteristics that traditional benchmarking approaches fail to capture.
Why LLM Inference Is Different
LLM inference differs from traditional AI inference in fundamental ways that affect every aspect of performance analysis.
Autoregressive Generation: One Token at a Time
When you ask a vision model to classify an image, it produces the answer in one forward pass. But when you ask an LLM to write a paragraph, it generates that paragraph one token at a time—each token requiring a complete forward pass through the model.
Traditional AI: Input → Model → Complete Output (one pass)
LLM Inference: Input → Model → Token 1
Model → Token 2
Model → Token 3
...
Model → Token N
A 100-token response requires 100 forward passes. This fundamentally changes the performance characteristics.
Two Distinct Phases with Different Bottlenecks
LLM inference has two phases with completely different performance profiles:
Prefill Phase (processing the input prompt):
- Processes all input tokens in parallel
- Compute-intensive: GPU tensor cores are busy
- Latency scales with prompt length
- Generates the initial KV cache
Decode Phase (generating the output):
- Generates one token at a time
- Memory-intensive: waiting for data transfer
- Latency is relatively constant per token
- Reads and updates the KV cache
This split creates a key insight: TTFT (Time To First Token) and TPS (Tokens Per Second) are largely independent metrics because they're dominated by different phases.
Why Decode Is Memory-Bound
During the decode phase, something counterintuitive happens: a 70-billion parameter model, running on hardware capable of 2000 TFLOPS, achieves only a tiny fraction of its theoretical compute performance.
The reason is arithmetic intensity. To generate one token, you must:
- Read 70 billion parameters from memory (~140 GB at FP16)
- Read the KV cache (potentially tens of GB more)
- Perform matrix operations for exactly 1 token
- Write the new KV cache entry
The computation is minimal compared to the data movement. Even with 3 TB/s memory bandwidth (H100), reading 140 GB takes ~47ms. That's your per-token latency floor for single-user inference.
Core Performance Metrics
TTFT (Time To First Token)
TTFT = Time from receiving request to outputting first token
Components:
TTFT = Network latency + Queue time + Prefill time
Influencing factors:
- Prompt length (primary)
- Model size
- Hardware performance
- System load
Typical values (7B model, single GPU):
Prompt 100 tokens: 50-100 ms
Prompt 1000 tokens: 200-500 ms
Prompt 4000 tokens: 500-2000 ms
TPS / Throughput
TPS = Tokens Per Second (tokens generated per second)
Two definitions:
1. Single-request TPS: Generation speed for one request
2. System TPS: Total throughput of entire system
Single-request TPS (7B model, single GPU):
FP16: 30-50 tokens/sec
INT8: 50-80 tokens/sec
INT4: 80-120 tokens/sec
System TPS (depends on batch size):
Batch 1: 30-50 tokens/sec
Batch 32: 500-1000 tokens/sec
Batch 128: 1500-3000 tokens/sec
TPOT (Time Per Output Token)
TPOT = Time to generate each output token
= 1 / Single-request TPS
TPOT determines user-perceived "typing speed"
Typical values:
Fast (good experience): < 50 ms/token
Acceptable: 50-100 ms/token
Slow (poor experience): > 100 ms/token
End-to-End Latency
Total Latency = TTFT + (N × TPOT)
Where N = number of output tokens
Example:
TTFT = 100 ms
TPOT = 30 ms
N = 100 tokens
Total = 100 + (100 × 30) = 3100 ms = 3.1 seconds
KV Cache
What Is KV Cache
KV Cache is the most important optimization technique in LLM inference:
vLLM and PagedAttention
PagedAttention Principle
PagedAttention introduced by vLLM solves the KV Cache fragmentation problem:
Traditional approach:
Pre-allocate maximum sequence length KV Cache for each request
Causes significant waste
PagedAttention:
Divide KV Cache into fixed-size "pages"
Allocate pages on demand
Similar to OS virtual memory
Effect:
- Memory utilization from ~50% to ~95%
- Can serve more concurrent requests
- Supports longer sequences
Using vLLM
# Install
pip install vllm
# Start API server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--tensor-parallel-size 1
# Use OpenAI-compatible API
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-hf",
"prompt": "Hello, world!",
"max_tokens": 100
}'
vLLM Performance Tuning
Key parameters:
--gpu-memory-utilization 0.9
GPU memory usage ratio (default 0.9)
--max-num-seqs 256
Maximum concurrent sequences
--max-num-batched-tokens 8192
Maximum tokens per iteration
--block-size 16
KV Cache page size
--swap-space 4
CPU swap space (GB)
Other LLM Inference Frameworks
TensorRT-LLM
NVIDIA's high-performance LLM inference framework:
Features:
- Deeply optimized CUDA kernels
- Supports Tensor Parallelism
- Supports In-flight Batching
- Integrates with Triton Inference Server
Performance:
Usually 10-30% faster than vLLM (scenario dependent)
Text Generation Inference (TGI)
Hugging Face's inference framework:
Features:
- Easy to use
- Supports multiple models
- Built-in Continuous Batching
- Docker deployment friendly
llama.cpp
LLM inference for CPU and edge devices:
Features:
- Pure C/C++ implementation
- Supports quantization (GGUF format)
- Runs on CPU, Apple Silicon
- Low memory footprint
Performance Optimization Strategies
Batching Strategies
1. Static Batching
- Fixed batch size
- Wait for batch to fill or timeout
- Simple but inefficient
2. Continuous Batching
- Dynamically add/remove requests
- Don't wait for batch to fill
- Higher GPU utilization
3. In-flight Batching
- Add new requests during decode
- Maximize throughput
Quantization
Quantization impact on LLM:
Precision Model Size Memory BW TPS Improvement
─────────────────────────────────────────────────
FP16 1.0x 1.0x 1.0x
INT8 0.5x 0.5x ~1.5-2x
INT4 0.25x 0.25x ~2-3x
Note:
- Decode is memory-bound
- Reducing data directly improves performance
- But may affect output quality
Speculative Decoding
Principle:
Use small model to "guess" multiple tokens
Verify with large model
If correct, accept multiple tokens at once
Effect:
- Can improve TPS by 2-3x
- Doesn't affect output quality
- Requires additional small model
Priya's Dashboard, Revisited
After her investigation, Priya rebuilt the monitoring dashboard for the chatbot. The new version tracked metrics that actually mattered for user experience:
- TTFT distribution (not just average, but P50/P95/P99)
- ITL histogram (to catch stuttering)
- KV cache memory pressure (to predict when GC would trigger)
- Prefill queue depth (to predict TTFT spikes)
She also added a "smoothness score"—a custom metric that penalized high ITL variance, even when average TPS looked fine.
"The old dashboard said everything was fine," she told her manager. "The new one would have caught the stuttering issue on day one."
The lesson she learned applies beyond LLMs: the right metrics depend on the user experience you're trying to deliver. For a batch processing system, average throughput might be enough. For an interactive chatbot, you need to measure what users actually feel—and that means understanding the unique characteristics of your workload.
LLM inference isn't just "AI inference with more parameters." It's a fundamentally different workload with its own performance model, its own bottlenecks, and its own metrics. Master those, and you can build systems that don't just perform well on benchmarks—they feel fast to users.
Summary
LLM performance analysis has its unique metrics and challenges:
Core Metrics
- TTFT: First token latency
- TPS: Generation speed
- TPOT: Time per token
- ITL: Inter-token latency
Key Technologies
- KV Cache: Avoid redundant computation
- PagedAttention: Solve memory fragmentation
- Continuous Batching: Improve throughput
- Quantization: Reduce memory requirements
Inference Frameworks
- vLLM: PagedAttention, high throughput
- TensorRT-LLM: NVIDIA optimized
- TGI: Hugging Face, easy to use
- llama.cpp: CPU/edge devices
Optimization Directions
- Prefill: Compute optimization (Tensor Cores)
- Decode: Memory optimization (quantization, KV Cache)
- System: Batching strategies