← Back to all posts

AI Embedding Models Explained: Google, Perplexity & OpenAI Compared

AI & Technology Siddhant Minocha · · 13 min read
Share:
AI Embedding Models Explained: Google, Perplexity & OpenAI Compared

Two embedding model announcements landed within days of each other. On March 10, Google released Gemini Embedding 2, their first natively multimodal embedding model. A week earlier, on February 26, Perplexity released pplx-embed, an open-source family that beats Google’s previous text-only model on benchmarks at a fraction of the price.

Both announcements are significant. But before we compare them, let’s establish what embedding models actually do and why the field is moving this fast.

What Are Vector Embeddings?

An embedding is a fixed-length array of numbers, called a vector, that encodes the meaning of some data. The key property: semantically similar inputs map to nearby vectors in high-dimensional space.

“Car” and “automobile” land close together. “Car” and “justice” are far apart. A query for “apple health benefits” can return articles about fruit even if they never use those exact words, because the underlying conceptual relationship is captured in the vector geometry.

This is fundamentally different from older approaches like TF-IDF, which counted word frequencies but had no way to represent meaning. Embeddings encode what something means, not just what words it uses.

The Vector Space

A typical embedding model outputs a vector of 1,024 to 3,072 floating-point numbers. Each dimension corresponds to some learned feature of the input (though these features are not human-interpretable). Together, they place the input at a specific coordinate in a very high-dimensional space.

Similarity is measured geometrically. The most common metric is cosine similarity: the angle between two vectors, irrespective of their magnitude.

cos(A,B)=ABA×B\cos(A, B) = \frac{A \cdot B}{\|A\| \times \|B\|}

A score near +1 means nearly identical meaning. A score near 0 means unrelated. Most production systems normalize embeddings to unit vectors, in which case cosine similarity equals a simple dot product, making it faster to compute at scale.

How Embedding Models Are Trained

Modern embedding models go through two phases.

Phase 1: Language Model Pretraining

A base model is first trained on massive text corpora to develop broad language understanding. This can be an encoder-style model (like BERT, which reads text bidirectionally), a decoder model (like GPT, which predicts left-to-right), or an encoder-decoder model. The choice of architecture matters more than it used to, as we’ll see with Perplexity’s approach.

Phase 2: Contrastive Fine-Tuning

The pre-trained model is then fine-tuned using contrastive learning to produce a well-structured embedding space. The training signal comes from pairs of inputs:

  • A query and its relevant document (a positive pair)
  • A query and an unrelated document (a negative pair)

The model learns to produce high similarity for positive pairs and low similarity for negative ones. The dominant training technique is InfoNCE loss with in-batch negatives. Given a batch of B positive (query, passage) pairs, each query is contrasted against all other B-1 passages in the batch as negatives:

L=log(exp(sim(q,p+)/τ)i=1Bexp(sim(q,pi)/τ))L = -\log\left(\frac{\exp(\text{sim}(q, p^+) / \tau)}{\sum_{i=1}^{B} \exp(\text{sim}(q, p_i) / \tau)}\right)

Where p+p^+ is the correct passage, pip_i are all passages in the batch, and τ\tau is a temperature hyperparameter controlling how sharply the model differentiates.

Hard negatives (passages topically related but not the correct answer) are added to make the problem harder and force the model to learn fine-grained distinctions. Larger batch sizes are also critical: more in-batch negatives give the model tougher problems to solve, producing better final representations.

Matryoshka Representation Learning (MRL)

MRL is an architectural innovation that has become standard in top embedding models. The idea: instead of training with a single loss function for the full vector dimension, you simultaneously optimize multiple loss functions at different dimension checkpoints (e.g., 256, 512, 1024, 3072).

The result is nested embeddings: the first N dimensions of any embedding already contain the most semantically rich information. You can truncate a 3,072-dimensional vector to 768 dimensions and lose almost no retrieval accuracy, which can mean 75% smaller storage and faster similarity search.

OpenAI introduced MRL to their embedding models in January 2024. Google’s Gemini Embedding 2 uses it as well.

Gemini Embedding 2: Google Goes Multimodal

Released on March 10, 2026, Gemini Embedding 2 (API model ID: gemini-embedding-2-preview) is Google’s most ambitious embedding model to date. Its defining feature is native multimodality: it maps text, images, video, audio, and documents into a single unified vector space.

What It Supports

ModalityLimit
Text8,192 tokens
ImagesUp to 6 per request
AudioUp to 80 seconds
VideoUp to 128 seconds
DocumentsUp to 6 pages (PDF)

The previous text-only model, gemini-embedding-001, had a 2,048-token text limit. Gemini Embedding 2 quadruples that for text and adds native handling of four additional modalities. Audio is processed natively, without converting to text first, which eliminates the information loss that comes from transcription. Multiple modalities can be passed in a single request, and the model captures the relationships between them in the same embedding space.

This enables cross-modal retrieval: a text query can retrieve the most relevant video clip, image, or audio segment from a single index, without separate models for each modality.

Flexible Dimensions via MRL

Gemini Embedding 2 uses MRL to support flexible output sizes from 128 to 3,072 dimensions, with recommended checkpoints at 768, 1,536, and 3,072.

The accuracy degradation from truncation is minimal:

DimensionsMTEB Score
3,07268.16
1,53668.17
76867.99

Cutting from 3,072 to 768 dimensions costs roughly 0.17 MTEB points while reducing storage by 75%.

Performance

On MTEB Multilingual v2, Gemini Embedding 2 scores 69.9, placing it at the top of the public leaderboard for API-accessible models at launch. For multimodal retrieval, it’s not close:

TaskGemini Embedding 2Amazon Nova 2Voyage Multimodal 3.5
Text-to-video68.860.355.2
Text-to-image93.484.0-

No other model currently offers unified cross-modal retrieval at this scale.

Pricing

ModalityStandardBatch
Text$0.20 / 1M tokens$0.10 / 1M tokens
Images$0.45 / 1M$0.225 / 1M
Audio$6.50 / 1M$3.25 / 1M
Video$12.00 / 1M$6.00 / 1M

For text-only workloads, Gemini Embedding 2 is 10x more expensive than OpenAI’s text-embedding-3-small ($0.02/1M). The premium is justified only when multimodal capabilities are needed, when multilingual retrieval performance is critical, or when cross-modal search is a product requirement.

pplx-embed: Perplexity’s Open-Source Challenger

Released on February 26, 2026, pplx-embed is a family of text embedding models from Perplexity AI. Unlike the other models in this post, they are MIT-licensed and available on Hugging Face for self-hosting. The API pricing is also the most competitive in the market.

Model Variants

ModelParametersDimensionsContext WindowPrice / 1M tokens
pplx-embed-v1-0.6b0.6B1,02432K tokens$0.004
pplx-embed-v1-4b4B2,56032K tokens$0.030
pplx-embed-context-v1-0.6b0.6B1,02432K tokens$0.008
pplx-embed-context-v1-4b4B2,56032K tokens$0.050

The -context- variants are designed for contextual RAG: they embed document chunks with awareness of the surrounding document context, enabling better disambiguation when a chunk is ambiguous on its own.

Architecture: Bidirectional Attention from a Decoder Base

This is where pplx-embed gets interesting. The models are built on Alibaba’s Qwen3 foundation model, a causal decoder (left-to-right attention). Perplexity converted it to use bidirectional attention through diffusion-based continued pretraining, where tokens are masked and predicted from both directions.

Why does this matter? Causal models see only left context for each token. Bidirectional attention allows full-sequence context at each position, which is better suited for representing the meaning of an entire input as a single vector. The challenge is that most frontier open-source models are causal decoders, not bidirectional encoders. Perplexity’s approach gets the best of both: they start with a strong, large-scale causal model and adapt it to be a better encoder.

Training data: approximately 250 billion tokens across 30 languages, split evenly between FineWebEdu (high-quality educational web text) and FineWeb2 (broad web corpus). Multi-stage contrastive fine-tuning follows.

Native Quantization

A key differentiator in pplx-embed is that quantization is baked into training, not added post-hoc.

The models are explicitly optimized to work at:

  • INT8: 4x storage reduction vs FP32
  • Binary: 32x storage reduction, with under 1.6% quality loss on the 4B model

This is achieved through a tanh-based mean pooling operation with straight-through gradient estimation during training. The similarity computation for INT8 embeddings must use cosine similarity rather than dot product, since the embeddings are not normalized.

For anyone building large-scale retrieval systems where vector storage is a significant cost, this is a meaningful practical advantage.

No Instruction Prefixes Required

Many competing models, including models from the E5 and GTE families, require task-specific prefixes like "query: " or "passage: " to achieve their published benchmark numbers. These prefixes provide an artificial 2-3% lift and can create indexing brittleness in production (using the wrong prefix at inference time hurts performance). pplx-embed achieves SOTA results without any instruction tuning or prefix requirements.

Performance

On MTEB Multilingual v2:

ModelMTEB Score
pplx-embed-v1-4B69.66%
Qwen3-Embedding-4B69.60%
gemini-embedding-00167.71%
text-embedding-3-large~64.60%

On ConTEB (the contextual retrieval benchmark):

ModelConTEB Score
pplx-embed-context-v1-4B81.96%
Voyage voyage-context-379.45%
Anthropic contextual model72.40%

The contextual variant is the current SOTA on ConTEB, ahead of Voyage’s dedicated contextual retrieval product.

32K Context Window

Most commercial embedding models cap text input at 8,192 tokens. pplx-embed’s 32K context window is four times larger, which matters for embedding entire research papers, legal documents, or long-form content without chunking.

OpenAI text-embedding-3: The Established Standard

Released in January 2024, OpenAI’s third-generation embedding models remain the most widely deployed in production, largely due to ecosystem familiarity, API reliability, and the extremely low cost of the small variant.

Specifications

ModelMax TokensDefault DimensionsMTEB ScorePrice / 1M tokens
text-embedding-3-small8,1911,53662.3$0.020
text-embedding-3-large8,1913,07264.6$0.130

Both use MRL for flexible dimension truncation. The large model’s embeddings can be truncated to as few as 256 dimensions while still outperforming the previous-generation text-embedding-ada-002 at its full 1,536 dimensions on MTEB.

Multilingual Improvement

The most dramatic improvement in the v3 models was multilingual performance. On the MIRACL multilingual retrieval benchmark:

ModelMIRACL Score
text-embedding-ada-00231.4
text-embedding-3-large54.9

A 23.5-point jump. The models now handle 100+ languages well, compared to the previous generation’s English-centric design.

Why They’re Still Widely Used

Despite being outperformed on benchmarks by newer models, the OpenAI embedding models are still a reasonable default for most production use cases:

  • text-embedding-3-small at $0.02/1M tokens is the cheapest way to get broadly capable text embeddings from a managed API with strong reliability guarantees.
  • The OpenAI ecosystem means zero additional integration work for teams already using the OpenAI API.
  • For English-only, text-only RAG at scale, the performance gap vs. pplx-embed or Gemini Embedding 2 is rarely the bottleneck.

Head-to-Head Comparison

ModelMTEB ScoreMax TokensDimensionsPrice / 1M tokensModalitiesOpen Source
Gemini Embedding 268.2-69.98,192128-3,072$0.20 (text)Text, Image, Audio, Video, PDFNo
pplx-embed-v1-4B69.732,0002,560$0.03TextYes (MIT)
pplx-embed-v1-0.6B~68+32,0001,024$0.004TextYes (MIT)
pplx-embed-context-v1-4B81.96 (ConTEB)32,0002,560$0.05Text (contextual)Yes (MIT)
text-embedding-3-large64.68,191256-3,072$0.13TextNo
text-embedding-3-small62.38,191512-1,536$0.02TextNo

A note on comparing MTEB scores: the pplx-embed and Gemini Embedding 2 scores above are on MTEB Multilingual v2, while the OpenAI scores are on MTEB English. These are different benchmark variants. Direct comparison requires running all models on the same split.

Which Model Should You Use?

There is no universal answer, but here is how to think about it:

Use Gemini Embedding 2 if:

  • You need cross-modal search: text queries over video, image, or audio collections.
  • Multilingual retrieval quality is a top priority and cost is secondary.
  • You are already in the Google Cloud ecosystem and want a managed service.

Use pplx-embed-v1-4B if:

  • You need the best text-only retrieval at a managed price ($0.03/1M tokens, no infrastructure required).
  • You want the option to self-host under MIT license as your scale grows.
  • Your documents are long and you need a 32K context window.
  • You are building a contextual RAG pipeline (use the -context- variant).

Use pplx-embed-v1-0.6B if:

  • You need very high throughput at the lowest possible cost ($0.004/1M tokens).
  • You are willing to self-host for zero variable cost (MIT license).
  • Latency and index size matter more than marginal accuracy gains.

Use text-embedding-3-small if:

  • You are already on the OpenAI API and want the path of least resistance.
  • You are doing English-only RAG at scale and want the cheapest managed option.
  • You want battle-tested reliability from a vendor with strong SLAs.

Use text-embedding-3-large if:

  • You want the best OpenAI option for multilingual or high-accuracy use cases.
  • You need custom output dimensions via MRL and want a familiar API.

The Bigger Picture

What this week’s announcements signal is a shift happening in multiple directions simultaneously.

Google is betting that the future of retrieval is multimodal: your knowledge base will contain text, images, video, and audio, and users will want to search all of it with a single query. Gemini Embedding 2 is an early production artifact of that vision.

Perplexity is betting that the future of retrieval is open and contextually aware: models that don’t require fragile instruction prefixes, that you can self-host, that are optimized for the chunked document retrieval patterns that dominate production RAG.

OpenAI is not yet responding with a new generation of embeddings. Their current models are over two years old. Given that competitors are now scoring 5-7 MTEB points higher, a response seems likely this year.

Embeddings are the quiet infrastructure layer that makes most modern AI applications work. RAG, semantic search, recommendation systems, and multimodal retrieval all depend on getting this layer right. The models releasing now are meaningfully better than what was available even six months ago, and the pace is not slowing.

Enjoyed this article?

Follow me for more insights on building AI, Technology, and Startups.

Also Read