Most teams pick an embedding model by copying a tutorial. Then they hit production and find that retrieval precision drops on their domain vocabulary, latency spikes under load, or multilingual queries return garbage results.
The model choice is architectural. Get it wrong and no amount of retrieval tuning will save you.
This is the full picture — how embeddings actually work at the math level, how embedding models are trained, and a complete comparison of every major model used in production across open source and Azure stacks.
What Is an Embedding?
An embedding is a fixed-length vector of floating-point numbers that encodes the meaning of a piece of text. Two texts with similar meaning produce vectors that are close together in high-dimensional space. Two texts with unrelated meaning produce vectors that are far apart.
The vector doesn't encode the words — it encodes the concept. This is why semantic search works across vocabulary gaps: "closing costs" and "cash at settlement" map to nearby vectors even though they share no words.
How Embeddings Work: The Transformer Architecture
Every modern embedding model is built on the transformer architecture, introduced in the 2017 paper "Attention Is All You Need." Understanding the transformer explains why embeddings work — and why they fail.
Step 1 — Tokenization
Text is broken into tokens — subword units, not whole words. "mortgage" becomes ["mort", "gage"]. "FHA203K" might become ["FH", "A", "203", "K"].
This is important: rare domain terms often tokenize into meaningless subword fragments. This is one reason domain-specific embedding models outperform general models on specialized vocabulary.
Step 2 — Token Embeddings + Positional Encoding
Each token ID maps to a learnable embedding vector (the token embedding). A positional encoding is added to inject word order information — transformers have no inherent sense of sequence.
token_representation = token_embedding[id] + positional_encoding[position]
Step 3 — Self-Attention
This is the core of the transformer. Every token attends to every other token in the sequence, computing a weighted sum based on relevance (the attention score).
After self-attention, "costs" is no longer a generic token — it's a contextually-aware representation that knows it appears near "closing" and "settlement." Same word in a different sentence ("costs of living") produces a different vector.
Step 4 — Stacking Layers
Transformers stack 12–24 (or more) attention layers. Each layer refines the representations. Early layers capture syntax. Later layers capture semantics and domain meaning.
Step 5 — Pooling to a Single Vector
After all transformer layers, you have one vector per token. To produce a single sentence embedding, you pool across tokens:
- [CLS] token pooling — use the first special token's vector (BERT-style)
- Mean pooling — average all token vectors (common in sentence-transformers)
- Max pooling — take the maximum value per dimension
Why mean pooling outperforms [CLS]: The [CLS] token was designed for classification tasks, not semantic similarity. Mean pooling captures signal from all tokens and consistently outperforms [CLS] for retrieval benchmarks.
How Embedding Models Are Trained
Raw pretrained transformers (BERT, RoBERTa) produce token representations — not useful sentence embeddings. Embedding models require additional training on top of a pretrained base.
Stage 1 — Masked Language Model Pretraining (Base Model)
The base transformer is trained on massive text corpora using masked language modeling: randomly mask 15% of tokens, predict the masked tokens. This teaches the model language structure, grammar, and world knowledge.
This is what BERT, RoBERTa, and most base models do. This stage costs millions of dollars in compute.
Stage 2 — Contrastive Learning (Embedding Training)
The base model is fine-tuned on sentence pairs using contrastive loss (typically Multiple Negatives Ranking Loss or InfoNCE).
Training data: pairs of semantically similar texts — question-answer pairs, paraphrase pairs, NLI (natural language inference) datasets. The quality of this data is what separates good embedding models from great ones.
Hard negatives: pairs that are superficially similar but semantically different. These are the hardest to train on and the most impactful. Models trained with hard negatives generalize significantly better to domain vocabulary.
Stage 3 — Matryoshka Representation Learning (Optional, Modern Models)
Newer models (OpenAI's text-embedding-3 series, Nomic Embed) use Matryoshka Representation Learning (MRL) — a training technique that forces the first N dimensions of the embedding to be independently meaningful.
This means you can truncate a 3072-dim embedding to 256 dims and still get 90%+ of the retrieval performance. This dramatically reduces storage and compute costs without retraining.
Practical impact: use 256-dim embeddings for high-volume, cost-sensitive use cases. Use full dimensions where precision is critical.
The Embedding Model Landscape
Dimensions, Cost, and Latency Comparison
| Model | Provider | Dims | Max Tokens | Multilingual | Latency (p50) | Cost per 1M tokens |
|---|---|---|---|---|---|---|
text-embedding-3-small | Azure OpenAI | 512–1536 | 8,191 | Partial | ~30ms | $0.02 |
text-embedding-3-large | Azure OpenAI | 256–3072 | 8,191 | Partial | ~45ms | $0.13 |
text-embedding-ada-002 | Azure OpenAI | 1536 | 8,191 | Partial | ~35ms | $0.10 |
BAAI/bge-large-en-v1.5 | HuggingFace OSS | 1024 | 512 | No (English only) | ~20ms* | Free (self-hosted) |
BAAI/bge-m3 | HuggingFace OSS | 1024 | 8,192 | Yes (100+ langs) | ~35ms* | Free (self-hosted) |
sentence-transformers/all-MiniLM-L6-v2 | HuggingFace OSS | 384 | 256 | No | ~5ms* | Free |
intfloat/multilingual-e5-large | HuggingFace OSS | 1024 | 512 | Yes (94 langs) | ~25ms* | Free |
Cohere embed-v3 | Cohere | 1024 | 512 | Yes (100+ langs) | ~40ms | $0.10 |
Nomic embed-text-v1.5 | Nomic AI | 64–768 | 8,192 | Partial | ~15ms* | Free (OSS) |
voyage-large-2 | Voyage AI | 1536 | 16,000 | Partial | ~50ms | $0.12 |
*Self-hosted latency varies by hardware. Values are GPU inference on A10G.
Open Source Models — Deep Dive
BAAI/bge-large-en-v1.5
The gold standard for English-only enterprise RAG. Trained by Beijing Academy of AI with carefully curated hard negatives. Consistently tops the MTEB (Massive Text Embedding Benchmark) leaderboard for English retrieval tasks.
Strengths: top retrieval performance on English text, 1024-dim, handles 512 tokens, free Weaknesses: English only, 512 token limit cuts off long documents, requires GPU for production throughput Best for: English-first enterprises with on-prem or air-gapped requirements, teams with GPU infra
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Prepend "Represent this sentence for retrieval:" for query embedding
query_embedding = model.encode("Represent this sentence for retrieval: closing costs")
doc_embedding = model.encode("How much cash is needed at settlement?")
BAAI/bge-m3
The multilingual evolution of bge-large. Supports 100+ languages, 8,192 token context, and — crucially — supports three retrieval modes simultaneously: dense (vector), sparse (BM25-like), and multi-vector (ColBERT-style). This makes it the only open source model that can do hybrid search natively.
Strengths: multilingual (100+ langs), 8K context, native hybrid retrieval, strong MTEB scores Weaknesses: large model (~570MB), slower than bge-large-en on English-only workloads Best for: multilingual knowledge bases, when you want one model to replace your BM25 + vector search stack
sentence-transformers/all-MiniLM-L6-v2
The "fast and cheap" option. 6-layer MiniLM (distilled from larger models), 384-dim, 256 token limit. Not competitive on precision benchmarks but runs on CPU with ~5ms latency.
Strengths: extremely fast, CPU-runnable, low memory footprint Weaknesses: 256 token limit, lower precision, English-only Best for: prototyping, high-volume low-stakes retrieval, resource-constrained environments
intfloat/multilingual-e5-large
Microsoft Research's multilingual embedding model. Strong cross-lingual retrieval — can match an English query to a French document. Important: requires query prefix "query: " and document prefix "passage: " during inference.
# Critical: prefix is required for correct behavior
query = "query: What are closing costs?"
doc = "passage: Les frais de clôture comprennent..." # French document
Strengths: 94 language support, cross-lingual (query in English, docs in Spanish works) Weaknesses: requires prefixes (easy to forget, breaks retrieval silently), lower English precision than bge-large-en Best for: multinational enterprises with mixed-language document corpora
Nomic embed-text-v1.5
Fully open source (weights + training code + data). Supports Matryoshka dimensions (64 to 768). 8,192 token context. Competitive with OpenAI ada-002 at zero cost.
Strengths: long context (8K), MRL dimensions, fully open (FOSS), strong performance/cost ratio Weaknesses: newer model, less battle-tested in enterprise, smaller community Best for: cost-sensitive teams who want long context without paying OpenAI rates
Azure / Managed Models — Deep Dive
text-embedding-3-small
OpenAI's cost-optimized embedding model with Matryoshka support. Default to 1536-dim or truncate to 512 for storage savings with minimal precision loss. Replaces ada-002 for most use cases at a fraction of the cost.
Cost: $0.02/1M tokens — 5x cheaper than ada-002 Strengths: fast, cheap, MRL support, Azure managed (no infra), integrates natively with Azure AI Search Weaknesses: 8,191 token limit (documents get truncated), English-dominant (multilingual support is inconsistent) Best for: high-volume English retrieval where cost matters, quick Azure AI Search integration
text-embedding-3-large
OpenAI's highest-precision embedding model. 3072-dim with MRL — truncate to 256 for cost savings or use full 3072 for maximum precision. Outperforms ada-002 on MTEB by a significant margin.
Cost: $0.13/1M tokens Strengths: best-in-class precision on MTEB, MRL support (use 256-dim for cost savings), Azure enterprise SLA Weaknesses: 6.5x more expensive than text-embedding-3-small, latency higher at full dimensions Best for: regulated industries where retrieval precision is non-negotiable, MortgageIQ-style use cases where a wrong answer has compliance consequences
# Azure OpenAI — truncate to 256 dims for 90% precision at 12x lower storage cost
from openai import AzureOpenAI
client = AzureOpenAI(...)
response = client.embeddings.create(
model="text-embedding-3-large",
input="What are FHA loan limits for 2025?",
dimensions=256 # Matryoshka truncation
)
text-embedding-ada-002
The previous generation model. Still widely deployed but outperformed by text-embedding-3-small on precision and cost. No Matryoshka support — fixed 1536-dim.
Recommendation: if you're starting a new project, use text-embedding-3-small or 3-large. If ada-002 is already in production, migration is low-risk but requires re-embedding your entire index.
Multilingual Support — Full Comparison
| Model | Languages | Cross-lingual? | Notes |
|---|---|---|---|
bge-large-en-v1.5 | English only | No | Best English precision |
bge-m3 | 100+ | Yes | Query EN → Doc FR works |
multilingual-e5-large | 94 | Yes | Requires query/passage prefixes |
all-MiniLM-L6-v2 | English dominant | No | Degrades badly on non-English |
text-embedding-3-small | 50+ (inconsistent) | Partial | English-dominant training |
text-embedding-3-large | 50+ (inconsistent) | Partial | Better than small, not purpose-built |
Cohere embed-v3 | 100+ | Yes | Purpose-built multilingual, strong |
voyage-large-2 | English dominant | No | English-focused |
Cross-lingual means a query in English can retrieve documents in French, Spanish, or German. This is different from multilingual (supports many languages but query and document must be in the same language).
For genuine cross-lingual retrieval — bge-m3 (open source) or Cohere embed-v3 (managed) are the production-grade choices.
Cost vs. Latency vs. Quality — Decision Matrix
When to Fine-Tune Your Embedding Model
General-purpose embedding models fail in three predictable scenarios:
1. Rare domain vocabulary — terms like "CONV30", "FHA203K", "DSCR", "mTLS" are split into meaningless subword tokens. The model assigns weak embeddings to tokens it rarely saw during training.
2. Domain-specific semantic relationships — in general text, "conventional" and "traditional" are near-synonyms. In mortgage, "conventional loan" is a specific product type. General models don't learn this distinction.
3. Retrieval on short queries — when users submit 2–4 word queries, general models have insufficient signal. A domain fine-tuned model learns to match short domain queries to the right document sections.
When to Fine-Tune (Triggers)
| Signal | Threshold | Action |
|---|---|---|
| MTEB retrieval score | Below 0.60 on your eval set | Fine-tune |
| Domain term miss rate | Above 5% | Fine-tune or add BM25 |
| User negative feedback rate | Above 15% | Evaluate + fine-tune |
| Unique domain terms in corpus | Above 500 rare terms | Consider fine-tuning |
How to Fine-Tune (Open Source)
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from datasets import Dataset
# Start from bge-large-en-v1.5
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Training pairs: (query, positive_doc, negative_doc)
train_dataset = Dataset.from_dict({
"anchor": ["What are FHA loan limits?", ...],
"positive": ["FHA loan limits in 2025 are $498,257...", ...],
"negative": ["Conventional loan down payment requirements...", ...]
})
loss = losses.MultipleNegativesRankingLoss(model)
args = SentenceTransformerTrainingArguments(
output_dir="bge-large-mortgage",
num_train_epochs=3,
per_device_train_batch_size=32,
learning_rate=2e-5,
)
Minimum training data: 1,000 high-quality pairs. 10,000+ for meaningful gains. Use your existing user query logs as anchors and your document chunks as positives — this is the highest-signal training data you can have.
How to Fine-Tune (Azure)
Azure AI Foundry supports embedding fine-tuning for text-embedding-3-small and text-embedding-3-large via the fine-tuning API. Upload JSONL of triplets, trigger a fine-tune job, deploy the custom model endpoint.
Cost: ~$0.03/1K training tokens + standard inference on the custom endpoint.
Indexing Strategy by Model
Different models require different indexing configuration:
| Model | Index Dims | Distance Metric | Normalization Required? |
|---|---|---|---|
bge-large-en-v1.5 | 1024 | Cosine | Yes (model outputs are not normalized) |
bge-m3 | 1024 | Cosine | Yes |
all-MiniLM-L6-v2 | 384 | Cosine | Yes |
text-embedding-3-large | 256–3072 | Cosine | No (OpenAI normalizes) |
text-embedding-3-small | 512–1536 | Cosine | No |
multilingual-e5-large | 1024 | Cosine | Yes |
Always normalize before indexing for open source models. Unnormalized vectors produce incorrect cosine similarity scores. This is a silent failure — retrieval appears to work but rankings are wrong.
import numpy as np
def normalize(embedding):
return embedding / np.linalg.norm(embedding)
In Azure AI Search, set vectorSearchAlgorithmConfiguration metric to cosine — normalization is handled internally.
What We Use in Production at MortgageIQ
Model: text-embedding-3-large at 512 dimensions (Matryoshka truncation)
Why 512 not 3072: At 512 dims, we retain ~93% of retrieval precision vs full 3072-dim, at 6x lower storage and 4x lower compute cost for the ANN index. For a knowledge base of ~50K mortgage guideline chunks, this matters.
Why not bge-large-en-v1.5: Azure AI Search's semantic ranker (cross-encoder reranking) is trained on OpenAI embedding spaces. Mixing bge embeddings with Azure's semantic ranker degrades reranking precision — the ranker expects OpenAI-scale vector distributions.
Why not text-embedding-3-small: Loan program codes and regulatory terms like "RESPA", "TRID", "QM loan" have weaker representation in the small model. In our eval set, 3-large outperforms 3-small by 11 points on domain-specific queries.
Multilingual: Not required — all MortgageIQ documents and queries are English. If we expand to Spanish-language loan officers, we'd evaluate bge-m3 or Cohere embed-v3 for cross-lingual retrieval.
Model Selection Guide
Key Takeaways
- Embeddings encode meaning, not words — the transformer's self-attention mechanism produces contextual representations where "closing costs" and "cash at settlement" map to nearby vectors regardless of vocabulary overlap.
- Embedding models are trained in two stages — pretrained base (masked language modeling) + contrastive fine-tuning on sentence pairs. The quality of hard negatives in Stage 2 is the single biggest driver of retrieval performance.
- Matryoshka models (text-embedding-3-large, Nomic) let you truncate dimensions at inference time — use 256-dim for cost savings, full dims where precision is non-negotiable.
- For multilingual/cross-lingual workloads, bge-m3 (open source) and Cohere embed-v3 (managed) are the production-grade choices — general models degrade badly on non-English queries.
- Fine-tune when your domain vocabulary is specialized — mortgage codes, medical terms, legal citations don't embed well in general models. 1,000 domain-specific pairs is enough to see meaningful gains.
- Normalize open source model outputs before indexing — this is the most common silent failure in RAG deployments using HuggingFace models.
Coming Up in This Series
- Day 4: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
- Day 5: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance