← All Posts
ai-mlApril 19, 2026ragretrievalembeddingshybrid-searchrerankingllmazure-openai

Why Your LLM Doesn't Know What Happened Last Tuesday

Standard LLMs have hard limits: training cutoffs, no private data, no source traceability. RAG fixes all three — but only if your retrieval layer is built right.

Your LLM doesn't know about your company. It doesn't know what happened last Tuesday. And it can't tell you where it got its answer from.

RAG — Retrieval-Augmented Generation — was supposed to fix this. It does. But only if the retrieval layer is built correctly. Most aren't.

This is Day 1 of a series on RAG architecture. We start with the problem before we touch the solution — because the teams that build bad RAG systems always skip this step.


The Three Hard Limits of Standard LLMs

Every language model has three structural constraints that no amount of prompt engineering can overcome.

1. Training Cutoff

GPT-4o, Claude 3.5, Gemini 1.5 — every major model was trained on a snapshot of the internet. That snapshot has a cutoff date. Anything published after that date doesn't exist to the model.

Ask GPT-4o about a mortgage rate change from two months ago. Ask it about your company's Q1 earnings. Ask it about a regulatory update from last quarter. You'll get one of two outcomes: confident hallucination or an honest "I don't know."

Neither is acceptable in an enterprise system.

2. No Access to Private Data

LLMs are trained on public internet data. They have never seen your internal documentation, your customer contracts, your engineering runbooks, your compliance policies, or your product specs.

When your users ask questions that require that private knowledge, the model is forced to improvise. In a regulated industry — mortgage, healthcare, financial services — improvised answers aren't just unhelpful. They're a liability.

3. No Source Traceability

Even when a standard LLM gives the right answer, it cannot tell you where it got it from. There's no citation. No document reference. No audit trail.

For consumer apps, this is an annoyance. For regulated industries, it's a blocker. Your compliance team needs to know what document justified a loan recommendation. "The model said so" is not a defensible answer.


What RAG Does

RAG doesn't retrain the model. It changes what the model sees at inference time.

Instead of relying on the model's baked-in knowledge, you retrieve the most relevant documents from your knowledge base and inject them into the model's context window — alongside the user's question.

The model now has:

  • Current information — whatever you've indexed, regardless of training cutoff
  • Private data — your internal documents, not public training data
  • Source traceability — every chunk that influenced the answer can be cited

This sounds simple. The implementation is not. The complexity lives entirely in the retrieval layer — and specifically in how you find the right chunks.


How Retrieval Works: The Core Problem

You have 50,000 documents in your knowledge base. A user asks a question. You need to identify the 5–10 document chunks most likely to contain the answer — in under 200ms.

There are two fundamentally different ways to do this. Each has different failure modes.

Sparse Retrieval — BM25

Sparse retrieval treats text as a bag of words. It counts term frequency and inverse document frequency. If the words in the query appear in the document, the document scores highly.

The canonical algorithm is BM25 — the engine behind Elasticsearch, Azure AI Search's keyword mode, and every traditional search system built in the last 20 years.

Where sparse retrieval wins:

  • Exact term matching — product codes, names, IDs, technical jargon
  • Rare terms with high information density
  • Speed — BM25 runs on inverted indexes; it's fast

Where sparse retrieval fails:

  • Vocabulary mismatch — "cash upfront" vs "closing costs"
  • Synonyms — "attorney" vs "lawyer" vs "counsel"
  • Paraphrase — same concept, different words
  • Cross-language queries

Dense Retrieval — Embeddings

Dense retrieval converts text into numerical vectors using an embedding model. Semantically similar text maps to nearby points in high-dimensional vector space. Similarity is measured by cosine distance, not keyword overlap.

Where dense retrieval wins:

  • Semantic similarity — finds "closing costs" when asked about "cash upfront"
  • Paraphrase matching
  • Cross-domain analogies
  • Long-tail queries where exact terms are unpredictable

Where dense retrieval fails:

  • Exact term matching — a vector search for "AZ-1234-B" may miss the document if the embedding model smooths over rare tokens
  • Domain-specific terminology not well-represented in the embedding model's training data
  • Short queries — insufficient signal for meaningful vector representation
  • Cold start — embedding models need fine-tuning for specialized domains to perform well

The Tradeoff Table

PropertySparse (BM25)Dense (Embeddings)
Keyword matchExcellentPoor
Semantic matchPoorExcellent
Rare terms / IDsExcellentUnreliable
Cross-languageNoYes (multilingual models)
SpeedVery fastFast (ANN index)
ExplainabilityHigh (term scores)Low (black box)
Cold startNoneNeeds embedding model

Neither retriever is universally better. Production systems need both.


Hybrid Search — The Production Standard

Hybrid search runs both retrievers in parallel and merges their ranked result lists using Reciprocal Rank Fusion (RRF).

RRF formula:

RRF_score(doc) = Σ 1 / (rank_i + k)

Where rank_i is the document's position in each ranked list and k is a smoothing constant (typically 60). A document ranked #1 by sparse and #3 by dense scores higher than one ranked #2 by either alone.

Why RRF works: It doesn't require score normalization across retrievers (BM25 scores and cosine similarities are not directly comparable). It's rank-based, not score-based, which makes it robust to score distribution differences.

Azure AI Search implements hybrid search natively with a single API call. In my MortgageIQ build, switching from pure vector search to hybrid retrieval reduced retrieval miss rate on exact loan program codes (like "CONV30", "FHA203K") from ~18% to under 2% — with zero regression on semantic queries.


Reranking — The Precision Layer

Hybrid search gives you a merged list of the top-K candidates. It does not give you the right ordering within that list. Retrieval optimizes for recall — getting the right documents into the candidate set. Reranking optimizes for precision — promoting the most relevant documents to the top.

Bi-Encoder vs Cross-Encoder

Bi-encoders (what embedding models are) encode the query and document independently. Fast, parallelizable, scales to millions of documents. But they compare query and document in isolation — the model never "sees" both together.

Cross-encoders take the query and candidate document as a concatenated input. The model processes them jointly, allowing full attention across both. This captures nuances — negation, conditionality, specificity — that bi-encoders miss.

The tradeoff is speed. Cross-encoders run in O(n) where n = number of candidates. Running a cross-encoder against 50,000 documents is not feasible at query time. The architecture is always: fast retrieval to a candidate set (20–100 docs), then cross-encoder reranking on the candidates.

Relevance Score Threshold

Don't pass all reranked chunks to the LLM. Set a threshold — typically 0.7+ on a 0–1 normalized scale. Chunks below the threshold are discarded.

This matters because low-relevance chunks don't help — they dilute the context and can actively mislead the model. If no chunks exceed the threshold, surface that to the user: "I don't have reliable information on this topic in my knowledge base."

This is a forcing function for honesty. It's also how you build a system that's auditable in regulated environments.


The Full RAG Pipeline — Day 1


Implementing RAG: Open Source vs Azure

Open Source Stack

ComponentToolNotes
Embedding modelsentence-transformers (BAAI/bge-large)Free, self-hosted, strong on English
Vector indexQdrant / Weaviate / ChromaQdrant best for production scale
Keyword searchElasticsearch / OpenSearchBM25 built-in
Hybrid fusionCustom RRF code~20 lines of Python
Rerankercross-encoder/ms-marco-MiniLM-L-6-v2HuggingFace, runs on CPU
LLMOllama (Llama 3 / Mistral) or OpenAI APIYour choice
OrchestrationLangChain / LlamaIndexLlamaIndex better for RAG pipelines

Best for: teams with ML engineers, cost-sensitive workloads, air-gapped/on-prem requirements, or when you need full control over the embedding model.


Azure Stack

ComponentToolNotes
Embedding modelAzure OpenAI text-embedding-3-largeManaged, no infra
Vector + keyword indexAzure AI SearchHybrid search + RRF in one service
RerankerAzure AI Search semantic rankerCross-encoder reranking as a flag
LLMAzure OpenAI GPT-4oSame API, enterprise SLA
OrchestrationSemantic Kernel / Azure AI FoundryNative .NET + Python SDKs
ObservabilityAzure Monitor + App InsightsIntegrated, no setup

Best for: enterprises already on Azure, regulated industries needing data residency and compliance (SOC 2, HIPAA), teams without dedicated ML infra, .NET shops.


When to Choose What

ScenarioGo Open SourceGo Azure
Startup / prototype✓ Lower cost
Air-gapped / on-prem✓ Only option
Fine-tuned domain model needed✓ Full control
Enterprise + regulated (fintech, healthcare)✓ Compliance built-in
.NET / C# primary stack✓ Semantic Kernel native
No ML team✓ Managed everything
Multi-region + SLA required✓ Azure infra

What I've Seen Fail in Production

Using pure vector search and calling it RAG. Vector search alone misses exact term queries. Every production knowledge base has product codes, named entities, and specific identifiers. You need BM25.

Skipping reranking to save latency. Retrieval recall without reranking precision means your LLM gets noisy context. The 100–150ms a cross-encoder costs is worth it for every use case where accuracy matters.

No threshold filtering. Passing all retrieved chunks to the LLM regardless of relevance score produces worse answers than fewer, higher-quality chunks. More context is not always better context.

Ignoring the vocabulary gap. If your domain has specialized terminology not well-represented in general embedding models (mortgage instruments, medical codes, legal citations), a general-purpose embedding model underperforms. Fine-tune, or use a domain-specific model.

Testing only with queries written by the team that built the system. Your team writes questions using document vocabulary. Your users don't. Test with real user queries from day one.


Key Numbers to Know

ComponentTypical LatencyScales to
BM25 keyword search5–20msBillions of docs
Vector ANN search20–50msHundreds of millions
Hybrid RRF fusion<5ms
Cross-encoder reranking (top 20)80–150ms
Total retrieval pipeline100–200ms
LLM generation (GPT-4o, 500 tokens)800–1500ms

Retrieval is fast. The bottleneck is always the LLM. Invest in retrieval quality over retrieval speed.


Key Takeaways

  • RAG solves three hard LLM limits — training cutoff, no private data access, and no source traceability — by injecting retrieved context at inference time, not by retraining the model.
  • Sparse (BM25) and dense (embedding) retrieval have opposite failure modes — use hybrid search with RRF to get keyword precision and semantic coverage in a single pipeline.
  • Retrieval is about recall; reranking is about precision — a cross-encoder reranker over your top-20 candidates consistently outperforms any single retriever alone.
  • Set a relevance score threshold before passing chunks to the LLM — low-relevance context degrades answer quality and breaks auditability in regulated systems.

Coming Up in This Series

  • Day 2: Chunking Strategy — fixed-size, semantic, recursive, and document-aware chunking and when each breaks
  • Day 3: Embedding Models — general vs. domain-specific, fine-tuning triggers, Azure vs. open source
  • Day 4: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
  • Day 5: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance