Embedding Models in Production: How They Work, How They're Built, and Which One to Use

Most teams pick an embedding model by copying a tutorial. Then they hit production and find that retrieval precision drops on their domain vocabulary, latency spikes under load, or multilingual queries return garbage results.

The model choice is architectural. Get it wrong and no amount of retrieval tuning will save you.

This is the full picture — how embeddings actually work at the math level, how embedding models are trained, and a complete comparison of every major model used in production across open source and Azure stacks.

What Is an Embedding?

An embedding is a fixed-length vector of floating-point numbers that encodes the meaning of a piece of text. Two texts with similar meaning produce vectors that are close together in high-dimensional space. Two texts with unrelated meaning produce vectors that are far apart.

The vector doesn't encode the words — it encodes the concept. This is why semantic search works across vocabulary gaps: "closing costs" and "cash at settlement" map to nearby vectors even though they share no words.

How Embeddings Work: The Transformer Architecture

Every modern embedding model is built on the transformer architecture, introduced in the 2017 paper "Attention Is All You Need." Understanding the transformer explains why embeddings work — and why they fail.

Step 1 — Tokenization

Text is broken into tokens — subword units, not whole words. "mortgage" becomes ["mort", "gage"]. "FHA203K" might become ["FH", "A", "203", "K"].

This is important: rare domain terms often tokenize into meaningless subword fragments. This is one reason domain-specific embedding models outperform general models on specialized vocabulary.

Step 2 — Token Embeddings + Positional Encoding

Each token ID maps to a learnable embedding vector (the token embedding). A positional encoding is added to inject word order information — transformers have no inherent sense of sequence.

token_representation = token_embedding[id] + positional_encoding[position]

Step 3 — Self-Attention

This is the core of the transformer. Every token attends to every other token in the sequence, computing a weighted sum based on relevance (the attention score).

After self-attention, "costs" is no longer a generic token — it's a contextually-aware representation that knows it appears near "closing" and "settlement." Same word in a different sentence ("costs of living") produces a different vector.

Step 4 — Stacking Layers

Transformers stack 12–24 (or more) attention layers. Each layer refines the representations. Early layers capture syntax. Later layers capture semantics and domain meaning.

Step 5 — Pooling to a Single Vector

After all transformer layers, you have one vector per token. To produce a single sentence embedding, you pool across tokens:

[CLS] token pooling — use the first special token's vector (BERT-style)
Mean pooling — average all token vectors (common in sentence-transformers)
Max pooling — take the maximum value per dimension

Why mean pooling outperforms [CLS]: The [CLS] token was designed for classification tasks, not semantic similarity. Mean pooling captures signal from all tokens and consistently outperforms [CLS] for retrieval benchmarks.

How Embedding Models Are Trained

Raw pretrained transformers (BERT, RoBERTa) produce token representations — not useful sentence embeddings. Embedding models require additional training on top of a pretrained base.

Stage 1 — Masked Language Model Pretraining (Base Model)

The base transformer is trained on massive text corpora using masked language modeling: randomly mask 15% of tokens, predict the masked tokens. This teaches the model language structure, grammar, and world knowledge.

This is what BERT, RoBERTa, and most base models do. This stage costs millions of dollars in compute.

Stage 2 — Contrastive Learning (Embedding Training)

The base model is fine-tuned on sentence pairs using contrastive loss (typically Multiple Negatives Ranking Loss or InfoNCE).

Training data: pairs of semantically similar texts — question-answer pairs, paraphrase pairs, NLI (natural language inference) datasets. The quality of this data is what separates good embedding models from great ones.

Hard negatives: pairs that are superficially similar but semantically different. These are the hardest to train on and the most impactful. Models trained with hard negatives generalize significantly better to domain vocabulary.

Stage 3 — Matryoshka Representation Learning (Optional, Modern Models)

Newer models (OpenAI's text-embedding-3 series, Nomic Embed) use Matryoshka Representation Learning (MRL) — a training technique that forces the first N dimensions of the embedding to be independently meaningful.

This means you can truncate a 3072-dim embedding to 256 dims and still get 90%+ of the retrieval performance. This dramatically reduces storage and compute costs without retraining.

Practical impact: use 256-dim embeddings for high-volume, cost-sensitive use cases. Use full dimensions where precision is critical.

The Embedding Model Landscape

Dimensions, Cost, and Latency Comparison

Model	Provider	Dims	Max Tokens	Multilingual	Latency (p50)	Cost per 1M tokens
`text-embedding-3-small`	Azure OpenAI	512–1536	8,191	Partial	~30ms	$0.02
`text-embedding-3-large`	Azure OpenAI	256–3072	8,191	Partial	~45ms	$0.13
`text-embedding-ada-002`	Azure OpenAI	1536	8,191	Partial	~35ms	$0.10
`BAAI/bge-large-en-v1.5`	HuggingFace OSS	1024	512	No (English only)	~20ms*	Free (self-hosted)
`BAAI/bge-m3`	HuggingFace OSS	1024	8,192	Yes (100+ langs)	~35ms*	Free (self-hosted)
`sentence-transformers/all-MiniLM-L6-v2`	HuggingFace OSS	384	256	No	~5ms*	Free
`intfloat/multilingual-e5-large`	HuggingFace OSS	1024	512	Yes (94 langs)	~25ms*	Free
`Cohere embed-v3`	Cohere	1024	512	Yes (100+ langs)	~40ms	$0.10
`Nomic embed-text-v1.5`	Nomic AI	64–768	8,192	Partial	~15ms*	Free (OSS)
`voyage-large-2`	Voyage AI	1536	16,000	Partial	~50ms	$0.12

*Self-hosted latency varies by hardware. Values are GPU inference on A10G.

Open Source Models — Deep Dive

BAAI/bge-large-en-v1.5

The gold standard for English-only enterprise RAG. Trained by Beijing Academy of AI with carefully curated hard negatives. Consistently tops the MTEB (Massive Text Embedding Benchmark) leaderboard for English retrieval tasks.

Strengths: top retrieval performance on English text, 1024-dim, handles 512 tokens, free Weaknesses: English only, 512 token limit cuts off long documents, requires GPU for production throughput Best for: English-first enterprises with on-prem or air-gapped requirements, teams with GPU infra

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Prepend "Represent this sentence for retrieval:" for query embedding
query_embedding = model.encode("Represent this sentence for retrieval: closing costs")
doc_embedding = model.encode("How much cash is needed at settlement?")

BAAI/bge-m3

The multilingual evolution of bge-large. Supports 100+ languages, 8,192 token context, and — crucially — supports three retrieval modes simultaneously: dense (vector), sparse (BM25-like), and multi-vector (ColBERT-style). This makes it the only open source model that can do hybrid search natively.

Strengths: multilingual (100+ langs), 8K context, native hybrid retrieval, strong MTEB scores Weaknesses: large model (~570MB), slower than bge-large-en on English-only workloads Best for: multilingual knowledge bases, when you want one model to replace your BM25 + vector search stack

sentence-transformers/all-MiniLM-L6-v2

The "fast and cheap" option. 6-layer MiniLM (distilled from larger models), 384-dim, 256 token limit. Not competitive on precision benchmarks but runs on CPU with ~5ms latency.

Strengths: extremely fast, CPU-runnable, low memory footprint Weaknesses: 256 token limit, lower precision, English-only Best for: prototyping, high-volume low-stakes retrieval, resource-constrained environments

intfloat/multilingual-e5-large

Microsoft Research's multilingual embedding model. Strong cross-lingual retrieval — can match an English query to a French document. Important: requires query prefix "query: " and document prefix "passage: " during inference.

# Critical: prefix is required for correct behavior
query = "query: What are closing costs?"
doc = "passage: Les frais de clôture comprennent..."  # French document

Strengths: 94 language support, cross-lingual (query in English, docs in Spanish works) Weaknesses: requires prefixes (easy to forget, breaks retrieval silently), lower English precision than bge-large-en Best for: multinational enterprises with mixed-language document corpora

Nomic embed-text-v1.5

Fully open source (weights + training code + data). Supports Matryoshka dimensions (64 to 768). 8,192 token context. Competitive with OpenAI ada-002 at zero cost.

Strengths: long context (8K), MRL dimensions, fully open (FOSS), strong performance/cost ratio Weaknesses: newer model, less battle-tested in enterprise, smaller community Best for: cost-sensitive teams who want long context without paying OpenAI rates

Azure / Managed Models — Deep Dive

text-embedding-3-small

OpenAI's cost-optimized embedding model with Matryoshka support. Default to 1536-dim or truncate to 512 for storage savings with minimal precision loss. Replaces ada-002 for most use cases at a fraction of the cost.

Cost: $0.02/1M tokens — 5x cheaper than ada-002 Strengths: fast, cheap, MRL support, Azure managed (no infra), integrates natively with Azure AI Search Weaknesses: 8,191 token limit (documents get truncated), English-dominant (multilingual support is inconsistent) Best for: high-volume English retrieval where cost matters, quick Azure AI Search integration

text-embedding-3-large

OpenAI's highest-precision embedding model. 3072-dim with MRL — truncate to 256 for cost savings or use full 3072 for maximum precision. Outperforms ada-002 on MTEB by a significant margin.

Cost: $0.13/1M tokens Strengths: best-in-class precision on MTEB, MRL support (use 256-dim for cost savings), Azure enterprise SLA Weaknesses: 6.5x more expensive than text-embedding-3-small, latency higher at full dimensions Best for: regulated industries where retrieval precision is non-negotiable, MortgageIQ-style use cases where a wrong answer has compliance consequences

# Azure OpenAI — truncate to 256 dims for 90% precision at 12x lower storage cost
from openai import AzureOpenAI

client = AzureOpenAI(...)
response = client.embeddings.create(
    model="text-embedding-3-large",
    input="What are FHA loan limits for 2025?",
    dimensions=256  # Matryoshka truncation
)

text-embedding-ada-002

The previous generation model. Still widely deployed but outperformed by text-embedding-3-small on precision and cost. No Matryoshka support — fixed 1536-dim.

Recommendation: if you're starting a new project, use text-embedding-3-small or 3-large. If ada-002 is already in production, migration is low-risk but requires re-embedding your entire index.

Multilingual Support — Full Comparison

Model	Languages	Cross-lingual?	Notes
`bge-large-en-v1.5`	English only	No	Best English precision
`bge-m3`	100+	Yes	Query EN → Doc FR works
`multilingual-e5-large`	94	Yes	Requires query/passage prefixes
`all-MiniLM-L6-v2`	English dominant	No	Degrades badly on non-English
`text-embedding-3-small`	50+ (inconsistent)	Partial	English-dominant training
`text-embedding-3-large`	50+ (inconsistent)	Partial	Better than small, not purpose-built
`Cohere embed-v3`	100+	Yes	Purpose-built multilingual, strong
`voyage-large-2`	English dominant	No	English-focused

Cross-lingual means a query in English can retrieve documents in French, Spanish, or German. This is different from multilingual (supports many languages but query and document must be in the same language).

For genuine cross-lingual retrieval — bge-m3 (open source) or Cohere embed-v3 (managed) are the production-grade choices.

Cost vs. Latency vs. Quality — Decision Matrix

When to Fine-Tune Your Embedding Model

General-purpose embedding models fail in three predictable scenarios:

1. Rare domain vocabulary — terms like "CONV30", "FHA203K", "DSCR", "mTLS" are split into meaningless subword tokens. The model assigns weak embeddings to tokens it rarely saw during training.

2. Domain-specific semantic relationships — in general text, "conventional" and "traditional" are near-synonyms. In mortgage, "conventional loan" is a specific product type. General models don't learn this distinction.

3. Retrieval on short queries — when users submit 2–4 word queries, general models have insufficient signal. A domain fine-tuned model learns to match short domain queries to the right document sections.

When to Fine-Tune (Triggers)

Signal	Threshold	Action
MTEB retrieval score	Below 0.60 on your eval set	Fine-tune
Domain term miss rate	Above 5%	Fine-tune or add BM25
User negative feedback rate	Above 15%	Evaluate + fine-tune
Unique domain terms in corpus	Above 500 rare terms	Consider fine-tuning

How to Fine-Tune (Open Source)

from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from datasets import Dataset

# Start from bge-large-en-v1.5
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Training pairs: (query, positive_doc, negative_doc)
train_dataset = Dataset.from_dict({
    "anchor": ["What are FHA loan limits?", ...],
    "positive": ["FHA loan limits in 2025 are $498,257...", ...],
    "negative": ["Conventional loan down payment requirements...", ...]
})

loss = losses.MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    output_dir="bge-large-mortgage",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
)

Minimum training data: 1,000 high-quality pairs. 10,000+ for meaningful gains. Use your existing user query logs as anchors and your document chunks as positives — this is the highest-signal training data you can have.

How to Fine-Tune (Azure)

Azure AI Foundry supports embedding fine-tuning for text-embedding-3-small and text-embedding-3-large via the fine-tuning API. Upload JSONL of triplets, trigger a fine-tune job, deploy the custom model endpoint.

Cost: ~$0.03/1K training tokens + standard inference on the custom endpoint.

Indexing Strategy by Model

Different models require different indexing configuration:

Model	Index Dims	Distance Metric	Normalization Required?
`bge-large-en-v1.5`	1024	Cosine	Yes (model outputs are not normalized)
`bge-m3`	1024	Cosine	Yes
`all-MiniLM-L6-v2`	384	Cosine	Yes
`text-embedding-3-large`	256–3072	Cosine	No (OpenAI normalizes)
`text-embedding-3-small`	512–1536	Cosine	No
`multilingual-e5-large`	1024	Cosine	Yes

Always normalize before indexing for open source models. Unnormalized vectors produce incorrect cosine similarity scores. This is a silent failure — retrieval appears to work but rankings are wrong.

import numpy as np

def normalize(embedding):
    return embedding / np.linalg.norm(embedding)

In Azure AI Search, set vectorSearchAlgorithmConfiguration metric to cosine — normalization is handled internally.

What We Use in Production at MortgageIQ

Model: text-embedding-3-large at 512 dimensions (Matryoshka truncation)

Why 512 not 3072: At 512 dims, we retain ~93% of retrieval precision vs full 3072-dim, at 6x lower storage and 4x lower compute cost for the ANN index. For a knowledge base of ~50K mortgage guideline chunks, this matters.

Why not bge-large-en-v1.5: Azure AI Search's semantic ranker (cross-encoder reranking) is trained on OpenAI embedding spaces. Mixing bge embeddings with Azure's semantic ranker degrades reranking precision — the ranker expects OpenAI-scale vector distributions.

Why not text-embedding-3-small: Loan program codes and regulatory terms like "RESPA", "TRID", "QM loan" have weaker representation in the small model. In our eval set, 3-large outperforms 3-small by 11 points on domain-specific queries.

Multilingual: Not required — all MortgageIQ documents and queries are English. If we expand to Spanish-language loan officers, we'd evaluate bge-m3 or Cohere embed-v3 for cross-lingual retrieval.

Model Selection Guide

Key Takeaways

Embeddings encode meaning, not words — the transformer's self-attention mechanism produces contextual representations where "closing costs" and "cash at settlement" map to nearby vectors regardless of vocabulary overlap.
Embedding models are trained in two stages — pretrained base (masked language modeling) + contrastive fine-tuning on sentence pairs. The quality of hard negatives in Stage 2 is the single biggest driver of retrieval performance.
Matryoshka models (text-embedding-3-large, Nomic) let you truncate dimensions at inference time — use 256-dim for cost savings, full dims where precision is non-negotiable.
For multilingual/cross-lingual workloads, bge-m3 (open source) and Cohere embed-v3 (managed) are the production-grade choices — general models degrade badly on non-English queries.
Fine-tune when your domain vocabulary is specialized — mortgage codes, medical terms, legal citations don't embed well in general models. 1,000 domain-specific pairs is enough to see meaningful gains.
Normalize open source model outputs before indexing — this is the most common silent failure in RAG deployments using HuggingFace models.

Coming Up in This Series

Day 4: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
Day 5: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance