← Back to all posts

KV Caching Explained: Why LLM Inference Is Memory-Bound

AI & Technology Siddhant Minocha · · 8 min read
Share:
KV Caching Explained: Why LLM Inference Is Memory-Bound

Most people assume that running a large language model is about doing math. Multiply matrices, apply activations, repeat. And during the first pass through your prompt, that’s roughly true. But once the model starts generating tokens one at a time, the bottleneck flips entirely.

LLM inference isn’t compute-bound. It’s memory-bound.

The culprit? The KV cache, a data structure that grows with every token the model has ever seen in the conversation. Understanding it is the key to understanding why inference is expensive, why long contexts cost so much VRAM, and why an entire ecosystem of optimization techniques exists to tame it.

What the KV Cache Actually Stores

Recall the core attention equation from the transformer architecture:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Every time the model processes a token, it projects that token’s embedding into three vectors: a Query (QQ), a Key (KK), and a Value (VV). The query asks “what should I attend to?”, the keys answer “here’s what each previous token offers,” and the values carry the actual information.

Here’s the critical insight: when generating token tn+1t_{n+1}, the model only needs the query for the new token. But it needs the keys and values for every previous token t1t_1 through tnt_n. Without caching, the model would have to re-project every previous token through the KK and VV weight matrices on every single generation step. That’s an enormous amount of redundant computation.

The KV cache solves this by storing the key and value projections for all previous tokens. Each new token’s KK and VV vectors are appended to the cache, and the full cache is read during the attention computation.

The Shape of the Cache

For a single attention layer with head dimension dkd_k and nheadsn_{\text{heads}} attention heads, the KV cache stores:

Cache per layer=2×nheads×dk×s\text{Cache per layer} = 2 \times n_{\text{heads}} \times d_k \times s

where ss is the current sequence length. The factor of 2 accounts for both the KK and VV tensors.

For the full model with LL layers:

Total KV cache=2×L×nheads×dk×s\text{Total KV cache} = 2 \times L \times n_{\text{heads}} \times d_k \times s

Notice that nheads×dk=dmodeln_{\text{heads}} \times d_k = d_{\text{model}} (the hidden dimension), so this simplifies to:

Total KV cache=2×L×dmodel×s\text{Total KV cache} = 2 \times L \times d_{\text{model}} \times s

This grows linearly with sequence length ss and linearly with model dimension (which itself scales with the square root of parameter count in practice). For the largest models with hundreds of layers and hidden dimensions in the thousands, this adds up fast.

Prefill vs. Decode: Two Different Workloads

LLM inference isn’t one workload. It’s two fundamentally different ones sharing the same GPU.

Prefill Phase

When you submit a prompt, the model processes all input tokens in parallel. This is the prefill phase. The GPU gets to do what it’s best at: crunching through large matrix multiplications with high arithmetic intensity.

During prefill:

  • All tokens are processed simultaneously
  • GPU compute units are highly utilized
  • The operation is compute-bound (limited by FLOPS)
  • Time scales roughly linearly with prompt length
  • The KV cache for the entire prompt is populated in one pass

Decode Phase

Once prefill is done, the model generates output tokens one at a time. Each step produces exactly one token, and that token’s key-value pair gets appended to the cache. This is the decode phase, and it behaves completely differently.

During decode:

  • Only one token is processed per step
  • The model must read the entire KV cache from GPU memory to compute attention
  • The actual computation per step is tiny relative to the data movement
  • The operation is memory-bandwidth-bound (limited by how fast you can shuttle bytes between HBM and the compute cores)
  • Time scales linearly with the number of output tokens

The arithmetic intensity during decode is extremely low. For a single token, the model performs a handful of matrix-vector multiplications and one attention pass over the full cache. The ratio of compute to memory access drops dramatically compared to prefill.

This is why you’ll hear ML engineers talk about “roofline models” for LLM inference. The decode phase sits firmly on the memory-bandwidth roof, not the compute roof.

Why This Matters Practically

This split has real consequences:

  • Batching helps decode more than prefill. Serving multiple requests simultaneously amortizes the memory bandwidth cost of reading model weights, pushing decode closer to compute-bound territory.
  • Long prompts are cheap per-token but expensive in total. Prefill is parallel, so a 4K prompt doesn’t take 4x longer than a 1K prompt. But decode over a 4K context means reading a much larger KV cache every step.
  • Time-to-first-token (TTFT) vs. time-per-output-token (TPOT) are governed by different bottlenecks. TTFT is mostly prefill (compute), TPOT is mostly decode (memory).

Back-of-Napkin: How Much VRAM Does the KV Cache Eat?

Let’s make this concrete with Llama 3 70B.

Model Specs

ParameterValue
Layers (LL)80
Hidden dimension (dmodeld_{\text{model}})8192
KV heads (GQA)8
Head dimension (dkd_k)128
PrecisionFP16 (2 bytes per element)

Llama 3 70B uses Grouped Query Attention (GQA), which means it uses fewer KV heads than query heads. Instead of 64 KV heads (matching the query heads), it uses only 8. This is itself a KV cache optimization: fewer KV heads means a proportionally smaller cache.

The Math

For GQA, the effective KV dimension per layer is:

dkv=nkv_heads×dk=8×128=1024d_{\text{kv}} = n_{\text{kv\_heads}} \times d_k = 8 \times 128 = 1024

Total KV cache size:

KV cache (bytes)=2×L×dkv×s×bytes_per_element\text{KV cache (bytes)} = 2 \times L \times d_{\text{kv}} \times s \times \text{bytes\_per\_element} =2×80×1024×s×2= 2 \times 80 \times 1024 \times s \times 2 =327,680×s bytes= 327{,}680 \times s \text{ bytes} 0.000305×s GB\approx 0.000305 \times s \text{ GB}

At Different Context Lengths

Context length (ss)KV cache sizePerspective
2,0480.625 GBComfortable on any modern GPU
8,1922.5 GBSignificant chunk of a 24 GB card
32,76810 GBNearly half of an A100 40 GB
131,072 (128K)40 GBExceeds an entire A100 40 GB

And remember: this is just the KV cache. The model weights for Llama 3 70B in FP16 take roughly 140 GB. So on a server with 8x A100 80 GB (640 GB total HBM), the model weights take 140 GB, and a single 128K context request’s KV cache takes 40 GB. With just a handful of concurrent users at long contexts, you’ve exhausted your entire memory budget.

Without GQA, It’s Worse

If Llama 3 70B used standard Multi-Head Attention (64 KV heads instead of 8), the KV cache would be 8x larger: 320 GB at 128K context. That’s why GQA exists. It’s not an optional optimization; it’s a necessity for long-context inference.

Why This Creates an Optimization Ecosystem

The KV cache problem has spawned an entire field of inference optimizations:

Memory-Level Optimizations

  • Quantized KV cache: Store keys and values in INT8 or INT4 instead of FP16, cutting memory by 2-4x with minimal quality loss.
  • Paged attention (vLLM): Manage KV cache memory like virtual memory pages instead of contiguous blocks, eliminating fragmentation.
  • KV cache offloading: Spill inactive KV cache entries to CPU RAM or NVMe, fetching them back when needed.

Architectural Optimizations

  • Grouped Query Attention (GQA): Share KV heads across multiple query heads. Llama 3, Gemma, and Mistral all use this.
  • Multi-Query Attention (MQA): The extreme case of GQA where all query heads share a single KV head. Used in PaLM and Falcon.
  • Sliding window attention: Only cache the most recent ww tokens instead of the full history. Mistral uses this with w=4096w = 4096.

Compression Techniques

  • KV cache eviction: Drop cache entries for tokens that are unlikely to be attended to again.
  • Token merging: Combine cache entries for similar tokens to reduce the effective sequence length.
  • Sparse attention patterns: Only attend to a subset of cached tokens, reducing both memory reads and cache size.

Each of these techniques trades off some combination of quality, latency, and complexity to manage the fundamental tension: attention needs to see everything, but memory is finite.

The Takeaway

The KV cache is one of those concepts that sits at the intersection of theory and systems engineering. Mathematically, it’s straightforward: store the key and value projections so you don’t recompute them. Practically, it’s the single largest constraint on how many users you can serve, how long their contexts can be, and how much your inference infrastructure costs.

If you’re building with LLMs, thinking about serving costs, or trying to understand why 128K context windows are so expensive, the KV cache is where to start. Every major optimization in modern LLM serving, from GQA to paged attention to quantized caches, exists because of this one data structure.

The GPU has plenty of compute. It’s waiting on memory.

Enjoyed this article?

Follow me for more insights on building AI, Technology, and Startups.

Also Read