Why Your LLM Doesn't Know What Happened Last Tuesday

Your LLM doesn't know about your company. It doesn't know what happened last Tuesday. And it can't tell you where it got its answer from.

RAG — Retrieval-Augmented Generation — was supposed to fix this. It does. But only if the retrieval layer is built correctly. Most aren't.

This is Day 1 of a series on RAG architecture. We start with the problem before we touch the solution — because the teams that build bad RAG systems always skip this step.

The Three Hard Limits of Standard LLMs

Every language model has three structural constraints that no amount of prompt engineering can overcome.

1. Training Cutoff

GPT-4o, Claude 3.5, Gemini 1.5 — every major model was trained on a snapshot of the internet. That snapshot has a cutoff date. Anything published after that date doesn't exist to the model.

Ask GPT-4o about a mortgage rate change from two months ago. Ask it about your company's Q1 earnings. Ask it about a regulatory update from last quarter. You'll get one of two outcomes: confident hallucination or an honest "I don't know."

Neither is acceptable in an enterprise system.

2. No Access to Private Data

LLMs are trained on public internet data. They have never seen your internal documentation, your customer contracts, your engineering runbooks, your compliance policies, or your product specs.

When your users ask questions that require that private knowledge, the model is forced to improvise. In a regulated industry — mortgage, healthcare, financial services — improvised answers aren't just unhelpful. They're a liability.

3. No Source Traceability

Even when a standard LLM gives the right answer, it cannot tell you where it got it from. There's no citation. No document reference. No audit trail.

For consumer apps, this is an annoyance. For regulated industries, it's a blocker. Your compliance team needs to know what document justified a loan recommendation. "The model said so" is not a defensible answer.

What RAG Does

RAG doesn't retrain the model. It changes what the model sees at inference time.

Instead of relying on the model's baked-in knowledge, you retrieve the most relevant documents from your knowledge base and inject them into the model's context window — alongside the user's question.

The model now has:

Current information — whatever you've indexed, regardless of training cutoff
Private data — your internal documents, not public training data
Source traceability — every chunk that influenced the answer can be cited

This sounds simple. The implementation is not. The complexity lives entirely in the retrieval layer — and specifically in how you find the right chunks.

How Retrieval Works: The Core Problem

You have 50,000 documents in your knowledge base. A user asks a question. You need to identify the 5–10 document chunks most likely to contain the answer — in under 200ms.

There are two fundamentally different ways to do this. Each has different failure modes.

Sparse Retrieval — BM25

Sparse retrieval treats text as a bag of words. It counts term frequency and inverse document frequency. If the words in the query appear in the document, the document scores highly.

The canonical algorithm is BM25 — the engine behind Elasticsearch, Azure AI Search's keyword mode, and every traditional search system built in the last 20 years.

Where sparse retrieval wins:

Exact term matching — product codes, names, IDs, technical jargon
Rare terms with high information density
Speed — BM25 runs on inverted indexes; it's fast

Where sparse retrieval fails:

Vocabulary mismatch — "cash upfront" vs "closing costs"
Synonyms — "attorney" vs "lawyer" vs "counsel"
Paraphrase — same concept, different words
Cross-language queries

Dense Retrieval — Embeddings

Dense retrieval converts text into numerical vectors using an embedding model. Semantically similar text maps to nearby points in high-dimensional vector space. Similarity is measured by cosine distance, not keyword overlap.

Where dense retrieval wins:

Semantic similarity — finds "closing costs" when asked about "cash upfront"
Paraphrase matching
Cross-domain analogies
Long-tail queries where exact terms are unpredictable

Where dense retrieval fails:

Exact term matching — a vector search for "AZ-1234-B" may miss the document if the embedding model smooths over rare tokens
Domain-specific terminology not well-represented in the embedding model's training data
Short queries — insufficient signal for meaningful vector representation
Cold start — embedding models need fine-tuning for specialized domains to perform well

The Tradeoff Table

Property	Sparse (BM25)	Dense (Embeddings)
Keyword match	Excellent	Poor
Semantic match	Poor	Excellent
Rare terms / IDs	Excellent	Unreliable
Cross-language	No	Yes (multilingual models)
Speed	Very fast	Fast (ANN index)
Explainability	High (term scores)	Low (black box)
Cold start	None	Needs embedding model

Neither retriever is universally better. Production systems need both.

Hybrid Search — The Production Standard

Hybrid search runs both retrievers in parallel and merges their ranked result lists using Reciprocal Rank Fusion (RRF).

RRF formula:

RRF_score(doc) = Σ 1 / (rank_i + k)

Where rank_i is the document's position in each ranked list and k is a smoothing constant (typically 60). A document ranked #1 by sparse and #3 by dense scores higher than one ranked #2 by either alone.

Why RRF works: It doesn't require score normalization across retrievers (BM25 scores and cosine similarities are not directly comparable). It's rank-based, not score-based, which makes it robust to score distribution differences.

Azure AI Search implements hybrid search natively with a single API call. In my MortgageIQ build, switching from pure vector search to hybrid retrieval reduced retrieval miss rate on exact loan program codes (like "CONV30", "FHA203K") from ~18% to under 2% — with zero regression on semantic queries.

Reranking — The Precision Layer

Hybrid search gives you a merged list of the top-K candidates. It does not give you the right ordering within that list. Retrieval optimizes for recall — getting the right documents into the candidate set. Reranking optimizes for precision — promoting the most relevant documents to the top.

Bi-Encoder vs Cross-Encoder

Bi-encoders (what embedding models are) encode the query and document independently. Fast, parallelizable, scales to millions of documents. But they compare query and document in isolation — the model never "sees" both together.

Cross-encoders take the query and candidate document as a concatenated input. The model processes them jointly, allowing full attention across both. This captures nuances — negation, conditionality, specificity — that bi-encoders miss.

The tradeoff is speed. Cross-encoders run in O(n) where n = number of candidates. Running a cross-encoder against 50,000 documents is not feasible at query time. The architecture is always: fast retrieval to a candidate set (20–100 docs), then cross-encoder reranking on the candidates.

Relevance Score Threshold

Don't pass all reranked chunks to the LLM. Set a threshold — typically 0.7+ on a 0–1 normalized scale. Chunks below the threshold are discarded.

This matters because low-relevance chunks don't help — they dilute the context and can actively mislead the model. If no chunks exceed the threshold, surface that to the user: "I don't have reliable information on this topic in my knowledge base."

This is a forcing function for honesty. It's also how you build a system that's auditable in regulated environments.

The Full RAG Pipeline — Day 1

Implementing RAG: Open Source vs Azure

Open Source Stack

Component	Tool	Notes
Embedding model	`sentence-transformers` (BAAI/bge-large)	Free, self-hosted, strong on English
Vector index	Qdrant / Weaviate / Chroma	Qdrant best for production scale
Keyword search	Elasticsearch / OpenSearch	BM25 built-in
Hybrid fusion	Custom RRF code	~20 lines of Python
Reranker	`cross-encoder/ms-marco-MiniLM-L-6-v2`	HuggingFace, runs on CPU
LLM	Ollama (Llama 3 / Mistral) or OpenAI API	Your choice
Orchestration	LangChain / LlamaIndex	LlamaIndex better for RAG pipelines

Best for: teams with ML engineers, cost-sensitive workloads, air-gapped/on-prem requirements, or when you need full control over the embedding model.

Azure Stack

Component	Tool	Notes
Embedding model	Azure OpenAI `text-embedding-3-large`	Managed, no infra
Vector + keyword index	Azure AI Search	Hybrid search + RRF in one service
Reranker	Azure AI Search semantic ranker	Cross-encoder reranking as a flag
LLM	Azure OpenAI GPT-4o	Same API, enterprise SLA
Orchestration	Semantic Kernel / Azure AI Foundry	Native .NET + Python SDKs
Observability	Azure Monitor + App Insights	Integrated, no setup

Best for: enterprises already on Azure, regulated industries needing data residency and compliance (SOC 2, HIPAA), teams without dedicated ML infra, .NET shops.

When to Choose What

Scenario	Go Open Source	Go Azure
Startup / prototype	✓ Lower cost	—
Air-gapped / on-prem	✓ Only option	—
Fine-tuned domain model needed	✓ Full control	—
Enterprise + regulated (fintech, healthcare)	—	✓ Compliance built-in
.NET / C# primary stack	—	✓ Semantic Kernel native
No ML team	—	✓ Managed everything
Multi-region + SLA required	—	✓ Azure infra

What I've Seen Fail in Production

Using pure vector search and calling it RAG. Vector search alone misses exact term queries. Every production knowledge base has product codes, named entities, and specific identifiers. You need BM25.

Skipping reranking to save latency. Retrieval recall without reranking precision means your LLM gets noisy context. The 100–150ms a cross-encoder costs is worth it for every use case where accuracy matters.

No threshold filtering. Passing all retrieved chunks to the LLM regardless of relevance score produces worse answers than fewer, higher-quality chunks. More context is not always better context.

Ignoring the vocabulary gap. If your domain has specialized terminology not well-represented in general embedding models (mortgage instruments, medical codes, legal citations), a general-purpose embedding model underperforms. Fine-tune, or use a domain-specific model.

Testing only with queries written by the team that built the system. Your team writes questions using document vocabulary. Your users don't. Test with real user queries from day one.

Key Numbers to Know

Component	Typical Latency	Scales to
BM25 keyword search	5–20ms	Billions of docs
Vector ANN search	20–50ms	Hundreds of millions
Hybrid RRF fusion	<5ms	—
Cross-encoder reranking (top 20)	80–150ms	—
Total retrieval pipeline	100–200ms	—
LLM generation (GPT-4o, 500 tokens)	800–1500ms	—

Retrieval is fast. The bottleneck is always the LLM. Invest in retrieval quality over retrieval speed.

Key Takeaways

RAG solves three hard LLM limits — training cutoff, no private data access, and no source traceability — by injecting retrieved context at inference time, not by retraining the model.
Sparse (BM25) and dense (embedding) retrieval have opposite failure modes — use hybrid search with RRF to get keyword precision and semantic coverage in a single pipeline.
Retrieval is about recall; reranking is about precision — a cross-encoder reranker over your top-20 candidates consistently outperforms any single retriever alone.
Set a relevance score threshold before passing chunks to the LLM — low-relevance context degrades answer quality and breaks auditability in regulated systems.

Coming Up in This Series

Day 2: Chunking Strategy — fixed-size, semantic, recursive, and document-aware chunking and when each breaks
Day 3: Embedding Models — general vs. domain-specific, fine-tuning triggers, Azure vs. open source
Day 4: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
Day 5: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance