KV Caching Explained: Why LLM Inference Is Memory-Bound
Most people assume that running a large language model is about doing math. Multiply matrices, apply activations, repeat. And during the first pass through your prompt, that’s roughly true. But once the model starts generating tokens one at a time, the bottleneck flips entirely.
LLM inference isn’t compute-bound. It’s memory-bound.
The culprit? The KV cache, a data structure that grows with every token the model has ever seen in the conversation. Understanding it is the key to understanding why inference is expensive, why long contexts cost so much VRAM, and why an entire ecosystem of optimization techniques exists to tame it.
What the KV Cache Actually Stores
Recall the core attention equation from the transformer architecture:
Every time the model processes a token, it projects that token’s embedding into three vectors: a Query (), a Key (), and a Value (). The query asks “what should I attend to?”, the keys answer “here’s what each previous token offers,” and the values carry the actual information.
Here’s the critical insight: when generating token , the model only needs the query for the new token. But it needs the keys and values for every previous token through . Without caching, the model would have to re-project every previous token through the and weight matrices on every single generation step. That’s an enormous amount of redundant computation.
The KV cache solves this by storing the key and value projections for all previous tokens. Each new token’s and vectors are appended to the cache, and the full cache is read during the attention computation.
The Shape of the Cache
For a single attention layer with head dimension and attention heads, the KV cache stores:
where is the current sequence length. The factor of 2 accounts for both the and tensors.
For the full model with layers:
Notice that (the hidden dimension), so this simplifies to:
This grows linearly with sequence length and linearly with model dimension (which itself scales with the square root of parameter count in practice). For the largest models with hundreds of layers and hidden dimensions in the thousands, this adds up fast.
Prefill vs. Decode: Two Different Workloads
LLM inference isn’t one workload. It’s two fundamentally different ones sharing the same GPU.
Prefill Phase
When you submit a prompt, the model processes all input tokens in parallel. This is the prefill phase. The GPU gets to do what it’s best at: crunching through large matrix multiplications with high arithmetic intensity.
During prefill:
- All tokens are processed simultaneously
- GPU compute units are highly utilized
- The operation is compute-bound (limited by FLOPS)
- Time scales roughly linearly with prompt length
- The KV cache for the entire prompt is populated in one pass
Decode Phase
Once prefill is done, the model generates output tokens one at a time. Each step produces exactly one token, and that token’s key-value pair gets appended to the cache. This is the decode phase, and it behaves completely differently.
During decode:
- Only one token is processed per step
- The model must read the entire KV cache from GPU memory to compute attention
- The actual computation per step is tiny relative to the data movement
- The operation is memory-bandwidth-bound (limited by how fast you can shuttle bytes between HBM and the compute cores)
- Time scales linearly with the number of output tokens
The arithmetic intensity during decode is extremely low. For a single token, the model performs a handful of matrix-vector multiplications and one attention pass over the full cache. The ratio of compute to memory access drops dramatically compared to prefill.
This is why you’ll hear ML engineers talk about “roofline models” for LLM inference. The decode phase sits firmly on the memory-bandwidth roof, not the compute roof.
Why This Matters Practically
This split has real consequences:
- Batching helps decode more than prefill. Serving multiple requests simultaneously amortizes the memory bandwidth cost of reading model weights, pushing decode closer to compute-bound territory.
- Long prompts are cheap per-token but expensive in total. Prefill is parallel, so a 4K prompt doesn’t take 4x longer than a 1K prompt. But decode over a 4K context means reading a much larger KV cache every step.
- Time-to-first-token (TTFT) vs. time-per-output-token (TPOT) are governed by different bottlenecks. TTFT is mostly prefill (compute), TPOT is mostly decode (memory).
Back-of-Napkin: How Much VRAM Does the KV Cache Eat?
Let’s make this concrete with Llama 3 70B.
Model Specs
| Parameter | Value |
|---|---|
| Layers () | 80 |
| Hidden dimension () | 8192 |
| KV heads (GQA) | 8 |
| Head dimension () | 128 |
| Precision | FP16 (2 bytes per element) |
Llama 3 70B uses Grouped Query Attention (GQA), which means it uses fewer KV heads than query heads. Instead of 64 KV heads (matching the query heads), it uses only 8. This is itself a KV cache optimization: fewer KV heads means a proportionally smaller cache.
The Math
For GQA, the effective KV dimension per layer is:
Total KV cache size:
At Different Context Lengths
| Context length () | KV cache size | Perspective |
|---|---|---|
| 2,048 | 0.625 GB | Comfortable on any modern GPU |
| 8,192 | 2.5 GB | Significant chunk of a 24 GB card |
| 32,768 | 10 GB | Nearly half of an A100 40 GB |
| 131,072 (128K) | 40 GB | Exceeds an entire A100 40 GB |
And remember: this is just the KV cache. The model weights for Llama 3 70B in FP16 take roughly 140 GB. So on a server with 8x A100 80 GB (640 GB total HBM), the model weights take 140 GB, and a single 128K context request’s KV cache takes 40 GB. With just a handful of concurrent users at long contexts, you’ve exhausted your entire memory budget.
Without GQA, It’s Worse
If Llama 3 70B used standard Multi-Head Attention (64 KV heads instead of 8), the KV cache would be 8x larger: 320 GB at 128K context. That’s why GQA exists. It’s not an optional optimization; it’s a necessity for long-context inference.
Why This Creates an Optimization Ecosystem
The KV cache problem has spawned an entire field of inference optimizations:
Memory-Level Optimizations
- Quantized KV cache: Store keys and values in INT8 or INT4 instead of FP16, cutting memory by 2-4x with minimal quality loss.
- Paged attention (vLLM): Manage KV cache memory like virtual memory pages instead of contiguous blocks, eliminating fragmentation.
- KV cache offloading: Spill inactive KV cache entries to CPU RAM or NVMe, fetching them back when needed.
Architectural Optimizations
- Grouped Query Attention (GQA): Share KV heads across multiple query heads. Llama 3, Gemma, and Mistral all use this.
- Multi-Query Attention (MQA): The extreme case of GQA where all query heads share a single KV head. Used in PaLM and Falcon.
- Sliding window attention: Only cache the most recent tokens instead of the full history. Mistral uses this with .
Compression Techniques
- KV cache eviction: Drop cache entries for tokens that are unlikely to be attended to again.
- Token merging: Combine cache entries for similar tokens to reduce the effective sequence length.
- Sparse attention patterns: Only attend to a subset of cached tokens, reducing both memory reads and cache size.
Each of these techniques trades off some combination of quality, latency, and complexity to manage the fundamental tension: attention needs to see everything, but memory is finite.
The Takeaway
The KV cache is one of those concepts that sits at the intersection of theory and systems engineering. Mathematically, it’s straightforward: store the key and value projections so you don’t recompute them. Practically, it’s the single largest constraint on how many users you can serve, how long their contexts can be, and how much your inference infrastructure costs.
If you’re building with LLMs, thinking about serving costs, or trying to understand why 128K context windows are so expensive, the KV cache is where to start. Every major optimization in modern LLM serving, from GQA to paged attention to quantized caches, exists because of this one data structure.
The GPU has plenty of compute. It’s waiting on memory.