Protein Language Models: Treating Amino Acid Sequences Like Sentences

Every large language model you’ve used works on a deceptively simple idea: take a sequence of tokens, learn the patterns, and predict what comes next. GPT does this with English. BERT does it with masked words. These models have reshaped how we write, code, and search.

Now the same idea is reshaping biology. Not with words and sentences, but with amino acids and proteins.

Protein language models (pLMs) treat the 20-letter amino acid alphabet exactly like a natural language. They train on millions of protein sequences, learn the grammar of molecular biology, and can now predict protein structures, classify mutations, and even generate entirely new proteins that don’t exist in nature.

This post breaks down how that works, why the parallels between language and biology run surprisingly deep, and where the field is heading.

DNA Is a Language. Literally.

Let’s start at the source code: DNA.

Your genome is written in a four-letter alphabet: A (adenine), C (cytosine), T (thymine), and G (guanine). These nucleotides pair up in a double helix (A with T, C with G) and stretch across roughly 3 billion base pairs in the human genome.

But four letters alone don’t do much. The magic happens when you read them in groups of three. Each triplet of nucleotides is called a codon, and each codon maps to one of 20 amino acids. There are 64 possible codons for 20 amino acids (plus 3 stop signals), so the genetic code is redundant. Multiple codons can encode the same amino acid. Think of it like synonyms: different spellings, same meaning.

The process follows what Francis Crick called the Central Dogma of molecular biology (1958):

\text{DNA} \xrightarrow{\text{transcription}} \text{mRNA} \xrightarrow{\text{translation}} \text{Protein}

DNA is transcribed into messenger RNA. RNA is translated by ribosomes into a chain of amino acids. That chain is a protein.

If DNA is the language:

Nucleotides (A, C, T, G) are the alphabet
Codons (triplets like ATG, GCA) are the words
Genes (instructions for one protein) are the sentences
The genome (all 20,000+ protein-coding genes) is the book

The 20-Letter Language of Proteins

Once translated, a protein is a linear chain of amino acids. There are exactly 20 standard amino acids, each represented by a single letter: A (Alanine), G (Glycine), W (Tryptophan), and so on. A typical protein is 300 to 500 amino acids long. Some exceed 30,000.

Here’s where the language analogy gets interesting.

In English, letters form words, words form sentences, and sentences carry meaning. In proteins:

Amino acids are the letters (20-letter alphabet vs. 26 in English)
Secondary structures like alpha-helices and beta-sheets are the words (recurring structural motifs)
Domains (functional units of a protein) are the sentences
Function (what the protein does: catalysis, signaling, transport) is the meaning

And just like natural language has grammar, proteins have biochemical rules. Hydrophobic residues cluster in the protein’s core. Charged residues prefer the surface. Cysteines form disulfide bonds at specific positions. Violate these rules and you get a misfolded protein, the biological equivalent of a grammatically broken sentence that no one can understand.

The deepest parallel is this: protein folding is translation. A linear sequence of amino acids (a “sentence”) folds into a specific 3D structure (the “meaning”). The same sequence always produces the same fold under normal conditions. The meaning is encoded in the sequence itself.

This is exactly the kind of pattern that language models are built to learn.

Enter Protein Language Models

If protein sequences are a language, can we train a language model on them?

Yes. And the approach is almost comically similar to what works for English.

Masked Language Modeling on Proteins

The most successful protein language models use masked language modeling (MLM), the same approach behind BERT. Take a protein sequence, randomly mask 15% of the amino acids, and train the model to predict the missing residues from context.

For BERT, the training sentence might look like:

“The cat sat on the [MASK]”

For a protein language model, it’s:

“M K T A Y [MASK] A K F E R Q H M D…”

The model learns which amino acids are likely at each position given the surrounding sequence. Over millions of sequences, it picks up the grammar: biochemical constraints, evolutionary conservation, structural preferences.

The training data? The UniRef and UniProt databases, which contain hundreds of millions of protein sequences cataloged from organisms across all domains of life. UniProtKB alone holds over 246 million sequences as of 2024.

What the Attention Mechanism Learns

Here’s where it gets remarkable. The transformer attention mechanism, trained only on sequences with no 3D structural data, learns to capture co-evolutionary relationships between amino acid positions.

Two amino acids that are far apart in the linear sequence but physically close in the folded 3D structure tend to co-evolve: when one mutates, the other compensates. Attention heads in protein language models learn exactly these patterns. Researchers have shown that you can extract contact maps (which amino acids are physically close in 3D) directly from the attention weights.

The model has never “seen” a protein structure. It learned structural information purely from the statistics of sequences. Language models do something similar: they learn syntax and semantics from text alone, without explicit grammar rules.

The Key Models

ESM (Meta / EvolutionaryScale)

The ESM family is the most influential line of protein language models. Built by Meta’s FAIR team (now spun out as EvolutionaryScale):

ESM-2 (2022): Encoder-only transformer, BERT-style MLM. Sizes range from 8M to 15 billion parameters, trained on ~65 million unique sequences from UniRef. The largest protein language model at the time. Published in Science (2023).
ESMFold: A structure prediction head built on ESM-2 embeddings. Predicts 3D protein structure from a single sequence, no multiple sequence alignment (MSA) needed.
ESM-3 (2024): 98 billion parameters at the largest scale. Trained on 2.78 billion proteins and 771 billion tokens. Multimodal: reasons jointly over sequence, structure, and function. Key achievement: generated a novel fluorescent protein at 58% sequence identity from known fluorescent proteins, estimated as simulating 500 million years of evolution. Published in Science (2025).
ESM Cambrian (2024): Efficiency-focused. The 300M model matches ESM-2 650M performance. The 600M model rivals ESM-2 3B.

ProtTrans (TU Munich)

ProtTrans (2021) took the “just scale it” approach from NLP and applied it to proteins. Their flagship, ProtT5 (~3B parameters), was trained on the Big Fantastic Database (BFD), which contains ~2.1 billion metagenomic protein sequences totaling 393 billion amino acids. Training used 5,616 GPUs on the Summit supercomputer. ProtT5 uses the T5 encoder-decoder architecture, adapted for amino acid sequences.

ProGen (Salesforce)

While ESM and ProtTrans learn to understand proteins (encoder models), ProGen learns to write them. It’s an autoregressive model (GPT-style) trained on 280 million sequences across 19,000+ protein families.

The key innovation: ProGen takes control tags as conditioning. You specify the desired protein family, function, and organism, then generate a sequence that matches those properties. In 2023, Salesforce published results in Nature Biotechnology showing ProGen-designed lysozyme variants were experimentally validated as functional. AI-written proteins that actually work.

ProGen2 (2023) scaled to 6.4 billion parameters and was trained on over 1 billion sequences.

AlphaFold (DeepMind)

AlphaFold isn’t technically a language model. It uses a specialized “Evoformer” architecture with MSAs and structural modules. But it belongs in this story because it proved the same thesis: sequence patterns encode structure.

AlphaFold2 (2020) dominated CASP14 with a median GDT score of 92.4 out of 100, comparable to experimental methods like X-ray crystallography. It scored nearly 3x higher than the next-best method. The AlphaFold Database now contains over 200 million predicted structures, covering nearly all of UniProt. Compare that to the ~206,000 experimental structures in the Protein Data Bank accumulated over 60+ years. Demis Hassabis and John Jumper received the 2024 Nobel Prize in Chemistry for this work.

AlphaFold3 (2024) expanded to all biomolecular complexes: proteins, DNA, RNA, small molecules, and ions. It uses a diffusion-based architecture and is now open-source.

EvoDiff (Microsoft Research)

EvoDiff (2023) brings the diffusion model paradigm to protein generation. Instead of autoregressive generation, it uses discrete diffusion over sequence space (640M parameters, trained on 42M sequences from UniRef50). It can generate proteins with disordered regions, something structure-conditioned methods struggle with.

Evo (Arc Institute)

Evo 2 (2025) is a 40 billion parameter DNA foundation model with a 1 million token context length, trained on 9.3 trillion nucleotides from 128,000+ genomes. It works at the nucleotide level (A, C, T, G directly) rather than the amino acid level, and generalizes across DNA, RNA, and protein tasks.

The NLP-to-Biology Translation Table

The parallels are striking enough to lay them out side by side:

NLP Concept	Protein Equivalent
Alphabet (26 letters)	20 amino acids
Vocabulary (~50K tokens)	20 amino acid tokens
Sentence	Protein sequence (300-500 residues)
Grammar	Biochemical rules (hydrophobic core, charge pairing)
Semantics (meaning)	Protein function (catalysis, binding, signaling)
Translation (language to language)	Protein folding (1D sequence to 3D structure)
BERT (masked prediction)	ESM-2 (masked amino acid prediction)
GPT (autoregressive generation)	ProGen (protein sequence generation)
Hallucination (nonsensical output)	Non-foldable or non-functional sequences
Context window (tokens)	Sequence length (residues)
Fine-tuning for tasks	Fine-tuning for structure/function prediction

The biggest difference? Scale of vocabulary. English NLP models tokenize into ~50,000 subword tokens. Protein models work with just 20 amino acid characters (plus a few special tokens). But protein “sentences” can be thousands of residues long, and the grammar is governed by physics and chemistry rather than social convention.

What This Enables

De Novo Protein Design

The most exciting application: designing proteins that don’t exist in nature. ESM-3 generated a novel bright fluorescent protein that is 58% identical to the nearest natural fluorescent protein. ProGen designed functional lysozymes validated in wet-lab experiments. The Protein Design Archive now catalogs over 1,500 structurally characterized de novo designed proteins as of March 2025.

AI-designed proteins already target therapeutic applications: binders for PD-L1, SARS-CoV-2 spike protein, PD-1, and CTLA-4.

Variant Effect Prediction

Every human carries thousands of missense mutations (single amino acid changes). Which ones cause disease? ESM-1b predicted the effects of all ~450 million possible missense variants across the human proteome. DeepMind’s AlphaMissense (2023) classified 89% of 71 million possible missense variants, published in Science.

This is like a language model deciding whether swapping one word in a sentence changes or destroys its meaning.

Drug Discovery and Enzyme Engineering

Protein language models are compressing the drug development cycle. AlphaFold-guided enzyme design has produced functional enzymes with targeted activity profiles. In 2025, an autonomous enzyme engineering platform combining ML, LLMs, and lab automation achieved a 90-fold improvement in substrate preference for a target enzyme.

The Deeper Point

What makes this convergence so striking is that it wasn’t designed. Nobody sat down and said “let’s make biology work like GPT.” The similarity is emergent: evolution optimized protein sequences through mutation and selection over billions of years, creating statistical patterns that happen to be learnable by the same architectures we built for human language.

Amino acid sequences have co-evolutionary dependencies (like attention patterns). They have local and global structure (like syntax and discourse). They encode function in sequence (like meaning in text). The transformer architecture, designed for one type of sequence, turns out to be a general-purpose sequence understanding machine.

Biology was the first language. We just didn’t have the right model to read it until now.