The Math Behind LLMs

Every time you prompt ChatGPT, Claude, or Gemini, billions of parameters execute the same handful of operations. Strip away the hype and you find something surprisingly compact: a few equations, repeated thousands of times.

This post walks through the core math that makes large language models work — not the full derivation, but enough that the next time someone says “attention mechanism,” you know exactly what’s happening under the hood.

The One Equation That Runs the World

At the heart of every transformer is scaled dot-product attention:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

That’s it. This single line of linear algebra is the engine behind every LLM conversation, every code completion, every generated image caption. Let’s unpack it piece by piece.

Queries, Keys, and Values

The input to a transformer layer is a sequence of token embeddings — vectors that represent words (or sub-words). Each token gets projected into three vectors:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information do I carry?”

These projections are just learned matrix multiplications. If your input embedding is x, then:

Q = xW_Q, \quad K = xW_K, \quad V = xW_V

where $W_Q$ , $W_K$ , $W_V$ are weight matrices the model learns during training.

The Dot Product: Measuring Relevance

The term $QK^T$ computes a dot product between every query and every key. The dot product measures how “aligned” two vectors are — a higher value means the query and key are more relevant to each other.

For a sequence of length $n$ , this produces an $n \times n$ matrix of attention scores. Each row tells one token how much to attend to every other token in the sequence.

Why Scale by $\sqrt{d_k}$ ?

Without scaling, the dot products grow large as the dimension $d_k$ increases. Large values push softmax into regions where the gradients are vanishingly small — the model can’t learn.

Dividing by $\sqrt{d_k}$ keeps the variance of the dot products roughly at 1, regardless of dimension. It’s a simple normalization trick that turns out to be critical for stable training.

Softmax: Turning Scores into Weights

The softmax function converts raw scores into a probability distribution:

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

After softmax, each row of the attention matrix sums to 1. The model is now making a weighted decision: “Given what I’m looking for (query), how much should I pull from each position (value)?”

The Final Multiply: Weighted Information

Multiplying the softmax weights by V produces the output — a weighted combination of value vectors. Tokens that scored high in the attention matrix contribute more to the output. Tokens that scored low are effectively ignored.

This is the mechanism that lets a model, when generating the word after “The capital of France is,” attend heavily to “France” even if it appeared 500 tokens earlier.

Multi-Head Attention: Parallel Perspectives

A single attention head can only capture one type of relationship. Multi-head attention runs the same operation in parallel with different learned projections:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O

Each head learns a different “lens” — one might track syntactic relationships, another semantic similarity, another positional patterns. The outputs are concatenated and projected back to the model dimension.

GPT-4 class models use 96+ heads across 96+ layers. That’s thousands of different attention patterns, all learned from data.

Beyond Attention: The Full Transformer Block

Attention is the star, but a transformer block has two more critical pieces:

Feed-Forward Network

After attention, each token passes through a two-layer neural network:

\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2

This is where the model stores and retrieves factual knowledge. Research suggests that the feed-forward layers act as key-value memories — specific neurons activate for specific concepts.

Layer Normalization and Residual Connections

Every sub-layer (attention, FFN) is wrapped with:

\text{output} = \text{LayerNorm}(x + \text{SubLayer}(x))

The residual connection ( $x + \ldots$ ) lets gradients flow straight through the network — without it, a 96-layer model simply couldn’t train. Layer normalization keeps activations in a stable range.

Training: Predicting the Next Token

The entire training objective is deceptively simple. Given a sequence of tokens, predict the next one:

\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t \mid x_1, \ldots, x_{t-1})

This is cross-entropy loss over the vocabulary. The model sees “The cat sat on the” and tries to maximize the probability of “mat.” Do this over trillions of tokens and the model learns grammar, facts, reasoning patterns, and code — all from this single objective.

Why Next-Token Prediction Works So Well

Predicting the next token in natural language is an incredibly rich learning signal. To predict well, the model must implicitly learn:

Syntax: which words can follow which
Semantics: what concepts are being discussed
World knowledge: facts required for plausible continuations
Reasoning: logical steps needed for coherent text

The compression hypothesis suggests that a sufficiently good predictor of natural language must develop a world model — and the evidence from modern LLMs supports this.

The Numbers That Make It Real

A 70-billion parameter model at fp16 precision occupies ~140 GB of memory. Each forward pass through all layers involves roughly $2 \times 70B = 140$ billion floating-point operations. At training time, you multiply that by 3x for backpropagation.

The leading models train on 10-15 trillion tokens. At the scale of GPT-4, training costs are estimated at $50-100M in compute alone.

All of this — the investment, the infrastructure, the engineering — exists to make one equation work at scale:

\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

What This Means for Builders

Understanding the math matters because it shapes what LLMs can and can’t do:

Context windows are quadratic: Attention is $O(n^2)$ in sequence length. That’s why 128K context costs so much more than 4K.
Knowledge is parametric: Facts live in the FFN weights. The model can’t “look up” new information unless you put it in the prompt.
Hallucinations are structural: The model produces the most likely next token, not the most truthful one. When the training data is sparse on a topic, the model interpolates — and sometimes gets it wrong.
Fine-tuning works because of residuals: Residual connections mean you can adjust a pre-trained model without breaking everything. LoRA exploits this by adding small rank-decomposition matrices to specific layers.

The Elegance of Simplicity

What strikes me most about transformer math is how little of it there is. The entire architecture can be written in about 300 lines of PyTorch. The attention equation fits on a sticky note.

The complexity isn’t in the math — it’s in the scale. The same operation, repeated across 96 layers, 96 heads, and trillions of training tokens, produces systems that write code, explain science, and carry conversations.

One equation. Repeated enough times. That’s what keeps the world moving.

The One Equation That Runs the World

Queries, Keys, and Values

The Dot Product: Measuring Relevance

Why Scale by $\sqrt{d_k}$ ?

Softmax: Turning Scores into Weights

The Final Multiply: Weighted Information

Multi-Head Attention: Parallel Perspectives

Beyond Attention: The Full Transformer Block

Feed-Forward Network

Layer Normalization and Residual Connections

Training: Predicting the Next Token

Why Next-Token Prediction Works So Well

The Numbers That Make It Real

What This Means for Builders

The Elegance of Simplicity

Also Read

AI Embedding Models Explained: Google, Perplexity & OpenAI Compared

Protein Language Models: Treating Amino Acid Sequences Like Sentences

The One Equation That Runs the World

Queries, Keys, and Values

The Dot Product: Measuring Relevance

Why Scale by dk\sqrt{d_k}dk​​?

Softmax: Turning Scores into Weights

The Final Multiply: Weighted Information

Multi-Head Attention: Parallel Perspectives

Beyond Attention: The Full Transformer Block

Feed-Forward Network

Layer Normalization and Residual Connections

Training: Predicting the Next Token

Why Next-Token Prediction Works So Well

The Numbers That Make It Real

What This Means for Builders

The Elegance of Simplicity

Also Read

AI Embedding Models Explained: Google, Perplexity & OpenAI Compared

Protein Language Models: Treating Amino Acid Sequences Like Sentences

Why Scale by $\sqrt{d_k}$ ?