The Math Behind LLMs
Every time you prompt ChatGPT, Claude, or Gemini, billions of parameters execute the same handful of operations. Strip away the hype and you find something surprisingly compact: a few equations, repeated thousands of times.
This post walks through the core math that makes large language models work — not the full derivation, but enough that the next time someone says “attention mechanism,” you know exactly what’s happening under the hood.
The One Equation That Runs the World
At the heart of every transformer is scaled dot-product attention:
That’s it. This single line of linear algebra is the engine behind every LLM conversation, every code completion, every generated image caption. Let’s unpack it piece by piece.
Queries, Keys, and Values
The input to a transformer layer is a sequence of token embeddings — vectors that represent words (or sub-words). Each token gets projected into three vectors:
- Query (Q): “What am I looking for?”
- Key (K): “What do I contain?”
- Value (V): “What information do I carry?”
These projections are just learned matrix multiplications. If your input embedding is x, then:
where , , are weight matrices the model learns during training.
The Dot Product: Measuring Relevance
The term computes a dot product between every query and every key. The dot product measures how “aligned” two vectors are — a higher value means the query and key are more relevant to each other.
For a sequence of length , this produces an matrix of attention scores. Each row tells one token how much to attend to every other token in the sequence.
Why Scale by ?
Without scaling, the dot products grow large as the dimension increases. Large values push softmax into regions where the gradients are vanishingly small — the model can’t learn.
Dividing by keeps the variance of the dot products roughly at 1, regardless of dimension. It’s a simple normalization trick that turns out to be critical for stable training.
Softmax: Turning Scores into Weights
The softmax function converts raw scores into a probability distribution:
After softmax, each row of the attention matrix sums to 1. The model is now making a weighted decision: “Given what I’m looking for (query), how much should I pull from each position (value)?”
The Final Multiply: Weighted Information
Multiplying the softmax weights by V produces the output — a weighted combination of value vectors. Tokens that scored high in the attention matrix contribute more to the output. Tokens that scored low are effectively ignored.
This is the mechanism that lets a model, when generating the word after “The capital of France is,” attend heavily to “France” even if it appeared 500 tokens earlier.
Multi-Head Attention: Parallel Perspectives
A single attention head can only capture one type of relationship. Multi-head attention runs the same operation in parallel with different learned projections:
Each head learns a different “lens” — one might track syntactic relationships, another semantic similarity, another positional patterns. The outputs are concatenated and projected back to the model dimension.
GPT-4 class models use 96+ heads across 96+ layers. That’s thousands of different attention patterns, all learned from data.
Beyond Attention: The Full Transformer Block
Attention is the star, but a transformer block has two more critical pieces:
Feed-Forward Network
After attention, each token passes through a two-layer neural network:
This is where the model stores and retrieves factual knowledge. Research suggests that the feed-forward layers act as key-value memories — specific neurons activate for specific concepts.
Layer Normalization and Residual Connections
Every sub-layer (attention, FFN) is wrapped with:
The residual connection () lets gradients flow straight through the network — without it, a 96-layer model simply couldn’t train. Layer normalization keeps activations in a stable range.
Training: Predicting the Next Token
The entire training objective is deceptively simple. Given a sequence of tokens, predict the next one:
This is cross-entropy loss over the vocabulary. The model sees “The cat sat on the” and tries to maximize the probability of “mat.” Do this over trillions of tokens and the model learns grammar, facts, reasoning patterns, and code — all from this single objective.
Why Next-Token Prediction Works So Well
Predicting the next token in natural language is an incredibly rich learning signal. To predict well, the model must implicitly learn:
- Syntax: which words can follow which
- Semantics: what concepts are being discussed
- World knowledge: facts required for plausible continuations
- Reasoning: logical steps needed for coherent text
The compression hypothesis suggests that a sufficiently good predictor of natural language must develop a world model — and the evidence from modern LLMs supports this.
The Numbers That Make It Real
A 70-billion parameter model at fp16 precision occupies ~140 GB of memory. Each forward pass through all layers involves roughly billion floating-point operations. At training time, you multiply that by 3x for backpropagation.
The leading models train on 10-15 trillion tokens. At the scale of GPT-4, training costs are estimated at $50-100M in compute alone.
All of this — the investment, the infrastructure, the engineering — exists to make one equation work at scale:
What This Means for Builders
Understanding the math matters because it shapes what LLMs can and can’t do:
- Context windows are quadratic: Attention is in sequence length. That’s why 128K context costs so much more than 4K.
- Knowledge is parametric: Facts live in the FFN weights. The model can’t “look up” new information unless you put it in the prompt.
- Hallucinations are structural: The model produces the most likely next token, not the most truthful one. When the training data is sparse on a topic, the model interpolates — and sometimes gets it wrong.
- Fine-tuning works because of residuals: Residual connections mean you can adjust a pre-trained model without breaking everything. LoRA exploits this by adding small rank-decomposition matrices to specific layers.
The Elegance of Simplicity
What strikes me most about transformer math is how little of it there is. The entire architecture can be written in about 300 lines of PyTorch. The attention equation fits on a sticky note.
The complexity isn’t in the math — it’s in the scale. The same operation, repeated across 96 layers, 96 heads, and trillions of training tokens, produces systems that write code, explain science, and carry conversations.
One equation. Repeated enough times. That’s what keeps the world moving.