RAG Is Not the Hard Part. Retrieval Is.

Eighty percent of RAG failures are retrieval failures. The model generates a confident, fluent, wrong answer — not because GPT-4o hallucinated, but because the retrieval layer returned the wrong chunks, or no chunks at all. Fixing the prompt doesn't help. You have to fix what goes into it.

I built MortgageIQ — a domain-grounded loan assistant on Azure OpenAI — and spent more time on the retrieval pipeline than on everything else combined. Here's what I learned.

What "Retrieval Failure" Actually Looks Like

Ask a RAG system "How much cash do I need upfront to buy a house?" and watch what happens.

If the knowledge base has a document called closing-costs.md with a section titled "Closing Cost Breakdown," a keyword-based retriever returns nothing — because the words "cash" and "upfront" don't appear in the document. The model gets no context. It answers from training data. The answer might be directionally right. It might not be. Either way, you have no traceability.

This is the retrieval miss pattern. It's invisible in demos — because demo questions are written to match the document vocabulary — and common in production, where users ask questions in their own words.

The gap between "cash upfront" and "closing costs" is a vocabulary gap. The fix isn't a better prompt. It's a retrieval layer that understands semantic similarity.

The Seven Levels of RAG Architecture

RAG is not one architecture. It's a progression. Each level adds capability to address a specific failure mode in the level before it.

Level	What it solves	What it doesn't
1 Naive	Nothing — prototype only	Context overflow, relevance, cost
2 Basic	Relevance via embedding similarity	Vocabulary mismatch
3 Hybrid	Vocabulary mismatch (BM25 + vector)	Vague multi-turn queries
4 Query Understanding	Multi-turn, vague queries	Precision of final chunks
5 Reranking	Precision — best 3 from top 20	Cost, latency
6 Production	Cost, guardrails, observability	Complex multi-source questions
7 Agentic	Multi-source, multi-hop reasoning	Latency, determinism

Most enterprise teams build Level 1 and call it production. Most production systems need Level 3 or 4. The right level depends on your failure mode.

Level 1: Naive RAG — Why It Fails

The simplest pattern: load all documents into the prompt on every query.

This works for prototyping with 3–5 short documents. It fails in production because:

Context window fills up — at 5 documents of 2,000 tokens each, you've consumed 10K tokens before the user asks anything
No relevance filtering — the model receives everything, relevant or not; quality degrades with noise
Cost scales with document count, not query complexity

Use Level 1 to prove the model can answer from given context. Use nothing else from it.

Level 2: Basic RAG — The Offline/Online Split

The foundational pattern that everything builds on: separate indexing from retrieval.

The offline pipeline runs once (or on document change). Documents are chunked, embedded, and stored. The online path embeds the query using the same model, finds the closest chunk vectors, and assembles the prompt from those chunks only.

The chunking decision matters more than the embedding model. I've seen teams spend weeks evaluating text-embedding-3-small vs text-embedding-3-large while their chunking strategy was splitting paragraphs mid-sentence. A bad chunking strategy degrades every query. A better embedding model helps 10% of queries.

Chunking: The Decisions Nobody Talks About

Fixed-size chunking (e.g., 512 tokens with 50-token overlap) is the default in every tutorial. It's wrong for most domain-specific content.

In MortgageIQ, the knowledge base is structured with markdown section headers. Each ## header marks a concept boundary — Credit Score Requirements, Down Payment Requirements, Debt-to-Income Ratio. Chunking at section boundaries produces 28 chunks across 5 files, each covering one complete concept.

Fixed-size chunking on the same documents would split "The minimum down payment for an FHA loan with a 580 credit score is 3.5%" across two chunks — one ending at "580" and one starting with "credit score is 3.5%." Both chunks score lower than the complete section for any FHA down payment query.

Chunking strategies by document type:

Document type	Best strategy	Why
Markdown with headers	Section boundaries at `##`	Each section is a complete concept
PDFs, long-form docs	Semantic boundary + 200-token overlap	Paragraph-level coherence
FAQ documents	Question + answer as one chunk	The answer is only meaningful with its question
Tables, structured data	Row or row-group as chunk	Schema context must travel with values
Code	Function or class boundaries	Snippets without surrounding context lose meaning

Level 2 failure mode: vector-only retrieval misses exact-term queries. "FHA credit score 580" retrieves the right chunk semantically. "580" alone may not — it's too short for embedding to carry meaning.

Level 3: Hybrid RAG — BM25 + Vector Search

The fix for vocabulary mismatch: run both keyword search and vector search, merge the ranked results.

Reciprocal Rank Fusion (RRF) merges the two ranked lists: a chunk that ranks high in both keyword and vector gets surfaced reliably. A chunk that ranks high in only one still appears, but lower.

What hybrid retrieval fixes:

Query	Keyword only	Vector only	Hybrid
"FHA credit score 580"	✅ exact match	✅ semantic match	✅
"cash upfront for a house"	❌ no overlap	✅ "closing costs" is semantically close	✅
"section 32 of RESPA"	✅ exact term	❌ too specific for embedding	✅
"how much income do I need"	❌ "income" != "DTI"	✅ debt-to-income is semantically close	✅

Azure AI Search implements hybrid retrieval natively — one resource, one API call, no custom fusion code. This is Level 3 for MortgageIQ Phase 4B: swap LocalFileRetriever for AzureSearchRetriever, one dependency injection line change.

Level 4: Query Understanding — What You Need for Multi-Turn Chat

Most RAG tutorials assume every query is self-contained. Real conversations aren't.

User: "What are the FHA credit score requirements?" User: "What about conventional loans?" ← this query retrieves nothing useful alone

"What about conventional loans?" has no retrieval signal without the conversation history. A query understanding step rewrites it: "What are the conventional loan credit score requirements?" — before retrieval runs.

Query reformulation techniques:

Technique	Use case	Tradeoff
Standalone query	Follow-up questions in chat	Adds one LLM call per turn
HyDE (hypothetical answer)	Vague or abstract queries	Higher recall; higher cost
Multi-query expansion	Ambiguous questions	Best recall; 2–3x retrieval calls
Step-back prompting	Highly specific questions	Better context; can over-generalize

For a loan copilot, standalone query reformulation is the right default. Most follow-up questions are clarifications ("what about FHA?" after asking about conventional), not abstract queries.

Level 5: Reranking — Precision Where It Matters

Initial retrieval is optimized for recall. You want the right chunk somewhere in the top 20. Reranking is optimized for precision. You want the best 3 chunks at the top.

Why two stages?

A bi-encoder (used in vector search) embeds the query and document separately and compares vectors. It's fast because document vectors are pre-computed. It's approximate because context is lost when encoding independently.

A cross-encoder sees the query and document together: "Given this question AND this chunk, how relevant is this chunk to this question?" It's more accurate — and 10–100x slower than vector search. Running it against every chunk in a 100K-document corpus is impractical. Running it against 20 candidates is ~50ms.

When reranking is non-optional: financial, legal, and medical domains where returning the third-best chunk instead of the best causes a wrong answer with real consequences. For MortgageIQ Advanced, Azure AI Search's built-in semantic ranker provides this without adding a separate service.

Level 6: Production RAG — The Operational Layer

Everything above is about retrieval quality. Production requires four more things: guardrails, caching, observability, and evaluation.

The four production additions:

Semantic cache — A near-duplicate of a previous query should return the cached answer, not re-run retrieval + LLM. At 5,000 queries/day with 30% cache hit rate, that's 1,500 Azure OpenAI calls saved daily. At GPT-4o pricing, that's meaningful at scale.

Guardrails — Input: reject off-topic queries (live rates, underwriter decisions) before they consume tokens. Output: verify the answer is grounded in the retrieved context, strip any PII before returning.

Token budget — Hard cap on retrieved context before prompt assembly. GPT-4o's 128K context window is large, but stuffing it increases latency and cost without proportional quality gain. MortgageIQ enforces 2,000 tokens — covers 3 substantial chunks with budget to spare.

Evaluation pipeline — The only way to know if retrieval quality improved after a change is to measure it. RAGAS scores three things: faithfulness (is the answer supported by the retrieved context?), answer relevance (does the answer address the question?), and context relevance (are the retrieved chunks actually about the question?). Run this on a golden dataset of 50–100 representative questions after every significant change.

Level 7: Agentic RAG — When Static Retrieval Isn't Enough

Some questions can't be answered by a single retrieval pass. "Can I afford a $450K home if I make $95K/year?" requires: retrieve DTI limits, calculate a monthly payment, check that payment against DTI, and synthesize a decision.

An agent decides whether to retrieve, which source to query, and when it has enough information.

The tradeoffs are real: multiple LLM + retrieval round trips, non-deterministic execution paths, harder to trace. Agentic RAG is the right answer when single-pass retrieval provably fails, not as a default.

SO at MortgageIQ is an agentic RAG system. Loan servicing questions often require retrieving policy rules, reading current loan state from Cosmos DB, and checking MSP status — three data sources in one response. Single-pass retrieval can't do that.

What I've Seen Fail

1. Evaluating the model, not the retrieval. Teams spend hours prompt-engineering around bad retrieval. The tell: "retrieval-miss" tag firing on questions that should be in scope. Fix: instrument your retrieval hit rate before touching the prompt.

2. Chunking by token count, not by concept. The symptom: answers are partly right. The cause: the relevant content was split across two chunks, neither of which scored high enough alone. Fix: chunk at semantic boundaries (section headers, paragraph ends) with modest overlap.

3. No token budget. Stuffing every retrieved chunk into the prompt without a cap. The model's attention dilutes across irrelevant content. Latency and cost increase. Fix: enforce a hard token cap before prompt assembly; prefer fewer, higher-quality chunks.

4. Treating retrieval miss as a model failure. "The model said it didn't know" is a retrieval miss, not a hallucination. The difference matters: a hallucination requires prompt/guardrail changes; a retrieval miss requires knowledge base expansion or retrieval strategy upgrade. Distinguish them with tags.

5. Skipping evaluation. Retrieval "feels better" after switching to hybrid search, but without a golden dataset and RAGAS scores, you can't quantify the improvement or detect regression after the next change.

Where MortgageIQ Is and Where It's Going

Phase	RAG Level	Key capability
Phase 4A (current)	Level 2 — keyword-only	Section chunking · Token budget · Retrieval tags
Phase 4B (next)	Level 3 — Hybrid	Azure AI Search · BM25 + vector · Fixes vocabulary gap
Advanced	Level 5–6	Reranking · Semantic cache · RAGAS eval pipeline

The architecture decisions made in Phase 4A — IRetrievalService abstraction, sources[] in the API response, response tags for observability — carry forward unchanged. Each upgrade replaces the retrieval backend, not the system.

The One Rule

Build the simplest retrieval layer that fails in a measurable way, then fix that specific failure.

Level 1 → 2: measure context overflow. Level 2 → 3: measure vocabulary gap (retrieval miss rate on synonym queries). Level 3 → 4: measure multi-turn retrieval quality. Level 4 → 5: measure answer precision on domain-specific factual queries. Level 5 → 6: measure cost and latency at production volume.

Every upgrade has a measurable trigger. If you can't measure the failure the next level fixes, you don't need that level yet.

MortgageIQ source code: github.com/shivojha/azure-ai-loan-copilot Project page: MortgageIQ — Azure AI Loan Copilot