Your LLM doesn't know about your company. It doesn't know what happened last Tuesday. And it can't tell you where it got its answer from.
RAG — Retrieval-Augmented Generation — was supposed to fix this. It does. But only if the retrieval layer is built correctly. Most aren't.
This is Day 1 of a series on RAG architecture. We start with the problem before we touch the solution — because the teams that build bad RAG systems always skip this step.
The Three Hard Limits of Standard LLMs
Every language model has three structural constraints that no amount of prompt engineering can overcome.
1. Training Cutoff
GPT-4o, Claude 3.5, Gemini 1.5 — every major model was trained on a snapshot of the internet. That snapshot has a cutoff date. Anything published after that date doesn't exist to the model.
Ask GPT-4o about a mortgage rate change from two months ago. Ask it about your company's Q1 earnings. Ask it about a regulatory update from last quarter. You'll get one of two outcomes: confident hallucination or an honest "I don't know."
Neither is acceptable in an enterprise system.
2. No Access to Private Data
LLMs are trained on public internet data. They have never seen your internal documentation, your customer contracts, your engineering runbooks, your compliance policies, or your product specs.
When your users ask questions that require that private knowledge, the model is forced to improvise. In a regulated industry — mortgage, healthcare, financial services — improvised answers aren't just unhelpful. They're a liability.
3. No Source Traceability
Even when a standard LLM gives the right answer, it cannot tell you where it got it from. There's no citation. No document reference. No audit trail.
For consumer apps, this is an annoyance. For regulated industries, it's a blocker. Your compliance team needs to know what document justified a loan recommendation. "The model said so" is not a defensible answer.
What RAG Does
RAG doesn't retrain the model. It changes what the model sees at inference time.
Instead of relying on the model's baked-in knowledge, you retrieve the most relevant documents from your knowledge base and inject them into the model's context window — alongside the user's question.
The model now has:
- Current information — whatever you've indexed, regardless of training cutoff
- Private data — your internal documents, not public training data
- Source traceability — every chunk that influenced the answer can be cited
This sounds simple. The implementation is not. The complexity lives entirely in the retrieval layer — and specifically in how you find the right chunks.
How Retrieval Works: The Core Problem
You have 50,000 documents in your knowledge base. A user asks a question. You need to identify the 5–10 document chunks most likely to contain the answer — in under 200ms.
There are two fundamentally different ways to do this. Each has different failure modes.
Sparse Retrieval — BM25
Sparse retrieval treats text as a bag of words. It counts term frequency and inverse document frequency. If the words in the query appear in the document, the document scores highly.
The canonical algorithm is BM25 — the engine behind Elasticsearch, Azure AI Search's keyword mode, and every traditional search system built in the last 20 years.
Where sparse retrieval wins:
- Exact term matching — product codes, names, IDs, technical jargon
- Rare terms with high information density
- Speed — BM25 runs on inverted indexes; it's fast
Where sparse retrieval fails:
- Vocabulary mismatch — "cash upfront" vs "closing costs"
- Synonyms — "attorney" vs "lawyer" vs "counsel"
- Paraphrase — same concept, different words
- Cross-language queries
Dense Retrieval — Embeddings
Dense retrieval converts text into numerical vectors using an embedding model. Semantically similar text maps to nearby points in high-dimensional vector space. Similarity is measured by cosine distance, not keyword overlap.
Where dense retrieval wins:
- Semantic similarity — finds "closing costs" when asked about "cash upfront"
- Paraphrase matching
- Cross-domain analogies
- Long-tail queries where exact terms are unpredictable
Where dense retrieval fails:
- Exact term matching — a vector search for "AZ-1234-B" may miss the document if the embedding model smooths over rare tokens
- Domain-specific terminology not well-represented in the embedding model's training data
- Short queries — insufficient signal for meaningful vector representation
- Cold start — embedding models need fine-tuning for specialized domains to perform well
The Tradeoff Table
| Property | Sparse (BM25) | Dense (Embeddings) |
|---|---|---|
| Keyword match | Excellent | Poor |
| Semantic match | Poor | Excellent |
| Rare terms / IDs | Excellent | Unreliable |
| Cross-language | No | Yes (multilingual models) |
| Speed | Very fast | Fast (ANN index) |
| Explainability | High (term scores) | Low (black box) |
| Cold start | None | Needs embedding model |
Neither retriever is universally better. Production systems need both.
Hybrid Search — The Production Standard
Hybrid search runs both retrievers in parallel and merges their ranked result lists using Reciprocal Rank Fusion (RRF).
RRF formula:
RRF_score(doc) = Σ 1 / (rank_i + k)
Where rank_i is the document's position in each ranked list and k is a smoothing constant (typically 60). A document ranked #1 by sparse and #3 by dense scores higher than one ranked #2 by either alone.
Why RRF works: It doesn't require score normalization across retrievers (BM25 scores and cosine similarities are not directly comparable). It's rank-based, not score-based, which makes it robust to score distribution differences.
Azure AI Search implements hybrid search natively with a single API call. In my MortgageIQ build, switching from pure vector search to hybrid retrieval reduced retrieval miss rate on exact loan program codes (like "CONV30", "FHA203K") from ~18% to under 2% — with zero regression on semantic queries.
Reranking — The Precision Layer
Hybrid search gives you a merged list of the top-K candidates. It does not give you the right ordering within that list. Retrieval optimizes for recall — getting the right documents into the candidate set. Reranking optimizes for precision — promoting the most relevant documents to the top.
Bi-Encoder vs Cross-Encoder
Bi-encoders (what embedding models are) encode the query and document independently. Fast, parallelizable, scales to millions of documents. But they compare query and document in isolation — the model never "sees" both together.
Cross-encoders take the query and candidate document as a concatenated input. The model processes them jointly, allowing full attention across both. This captures nuances — negation, conditionality, specificity — that bi-encoders miss.
The tradeoff is speed. Cross-encoders run in O(n) where n = number of candidates. Running a cross-encoder against 50,000 documents is not feasible at query time. The architecture is always: fast retrieval to a candidate set (20–100 docs), then cross-encoder reranking on the candidates.
Relevance Score Threshold
Don't pass all reranked chunks to the LLM. Set a threshold — typically 0.7+ on a 0–1 normalized scale. Chunks below the threshold are discarded.
This matters because low-relevance chunks don't help — they dilute the context and can actively mislead the model. If no chunks exceed the threshold, surface that to the user: "I don't have reliable information on this topic in my knowledge base."
This is a forcing function for honesty. It's also how you build a system that's auditable in regulated environments.
The Full RAG Pipeline — Day 1
Implementing RAG: Open Source vs Azure
Open Source Stack
| Component | Tool | Notes |
|---|---|---|
| Embedding model | sentence-transformers (BAAI/bge-large) | Free, self-hosted, strong on English |
| Vector index | Qdrant / Weaviate / Chroma | Qdrant best for production scale |
| Keyword search | Elasticsearch / OpenSearch | BM25 built-in |
| Hybrid fusion | Custom RRF code | ~20 lines of Python |
| Reranker | cross-encoder/ms-marco-MiniLM-L-6-v2 | HuggingFace, runs on CPU |
| LLM | Ollama (Llama 3 / Mistral) or OpenAI API | Your choice |
| Orchestration | LangChain / LlamaIndex | LlamaIndex better for RAG pipelines |
Best for: teams with ML engineers, cost-sensitive workloads, air-gapped/on-prem requirements, or when you need full control over the embedding model.
Azure Stack
| Component | Tool | Notes |
|---|---|---|
| Embedding model | Azure OpenAI text-embedding-3-large | Managed, no infra |
| Vector + keyword index | Azure AI Search | Hybrid search + RRF in one service |
| Reranker | Azure AI Search semantic ranker | Cross-encoder reranking as a flag |
| LLM | Azure OpenAI GPT-4o | Same API, enterprise SLA |
| Orchestration | Semantic Kernel / Azure AI Foundry | Native .NET + Python SDKs |
| Observability | Azure Monitor + App Insights | Integrated, no setup |
Best for: enterprises already on Azure, regulated industries needing data residency and compliance (SOC 2, HIPAA), teams without dedicated ML infra, .NET shops.
When to Choose What
| Scenario | Go Open Source | Go Azure |
|---|---|---|
| Startup / prototype | ✓ Lower cost | — |
| Air-gapped / on-prem | ✓ Only option | — |
| Fine-tuned domain model needed | ✓ Full control | — |
| Enterprise + regulated (fintech, healthcare) | — | ✓ Compliance built-in |
| .NET / C# primary stack | — | ✓ Semantic Kernel native |
| No ML team | — | ✓ Managed everything |
| Multi-region + SLA required | — | ✓ Azure infra |
What I've Seen Fail in Production
Using pure vector search and calling it RAG. Vector search alone misses exact term queries. Every production knowledge base has product codes, named entities, and specific identifiers. You need BM25.
Skipping reranking to save latency. Retrieval recall without reranking precision means your LLM gets noisy context. The 100–150ms a cross-encoder costs is worth it for every use case where accuracy matters.
No threshold filtering. Passing all retrieved chunks to the LLM regardless of relevance score produces worse answers than fewer, higher-quality chunks. More context is not always better context.
Ignoring the vocabulary gap. If your domain has specialized terminology not well-represented in general embedding models (mortgage instruments, medical codes, legal citations), a general-purpose embedding model underperforms. Fine-tune, or use a domain-specific model.
Testing only with queries written by the team that built the system. Your team writes questions using document vocabulary. Your users don't. Test with real user queries from day one.
Key Numbers to Know
| Component | Typical Latency | Scales to |
|---|---|---|
| BM25 keyword search | 5–20ms | Billions of docs |
| Vector ANN search | 20–50ms | Hundreds of millions |
| Hybrid RRF fusion | <5ms | — |
| Cross-encoder reranking (top 20) | 80–150ms | — |
| Total retrieval pipeline | 100–200ms | — |
| LLM generation (GPT-4o, 500 tokens) | 800–1500ms | — |
Retrieval is fast. The bottleneck is always the LLM. Invest in retrieval quality over retrieval speed.
Key Takeaways
- RAG solves three hard LLM limits — training cutoff, no private data access, and no source traceability — by injecting retrieved context at inference time, not by retraining the model.
- Sparse (BM25) and dense (embedding) retrieval have opposite failure modes — use hybrid search with RRF to get keyword precision and semantic coverage in a single pipeline.
- Retrieval is about recall; reranking is about precision — a cross-encoder reranker over your top-20 candidates consistently outperforms any single retriever alone.
- Set a relevance score threshold before passing chunks to the LLM — low-relevance context degrades answer quality and breaks auditability in regulated systems.
Coming Up in This Series
- Day 2: Chunking Strategy — fixed-size, semantic, recursive, and document-aware chunking and when each breaks
- Day 3: Embedding Models — general vs. domain-specific, fine-tuning triggers, Azure vs. open source
- Day 4: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
- Day 5: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance