Every RAG Pattern Explained — and Which One to Run in Production

There are 13 RAG patterns. Most teams pick one, apply it everywhere, and wonder why production fails.

The pattern you choose determines your retrieval precision, your latency budget, your infrastructure complexity, and whether your system can explain its answers to a compliance officer. Getting this wrong isn't a tuning problem — it's an architecture problem.

Here's the full map, how each one is implemented in open source and Azure, and what we run in production.

The Pattern Landscape

RAG patterns fall into four categories. Start with the category, then pick the pattern.

Foundational Patterns

Naive RAG

Retrieve → stuff into context → generate. No reranking, no query transformation, no threshold filtering.

Works in demos. Breaks in production when users ask questions in their own words instead of the document's vocabulary.

Don't ship this.

Advanced RAG

The baseline for any production system. Adds query rewriting, hybrid search (BM25 + vector), cross-encoder reranking, and relevance score thresholding on top of naive RAG.

Day 1 of this series covers Advanced RAG in depth. If you haven't implemented this yet, start here before anything else.

Query Patterns

These patterns change what gets sent to the retriever — not the retriever itself.

HyDE — Hypothetical Document Embeddings

Problem: Short or vague queries don't embed well. "FHA limits" as a query vector has weak signal. The embedding model doesn't have enough context to place it accurately in vector space.

Solution: Ask the LLM to generate a hypothetical answer first, embed that, and use the hypothetical answer's vector to retrieve real documents.

When to use: queries are consistently short or domain vocabulary is specialized enough that the query vector alone has poor precision.

Watch out for: HyDE hallucinations — if the LLM generates a wrong hypothetical, you retrieve wrong documents confidently. Always validate with a threshold filter on the final results.

Query Decomposition

Problem: Multi-hop questions ("Compare FHA and conventional loan down payment requirements for first-time buyers") require information from multiple documents. A single vector query returns a mixed bag.

Solution: Use the LLM to decompose the question into atomic sub-queries, retrieve independently for each, merge results, then synthesize.

When to use: your users ask analytical or comparative questions, not lookup questions.

Cost: 2–4x more LLM calls and retrieval passes. Set a hard limit on sub-query count (3–5 max).

Step-Back Prompting

Problem: Specific questions return specific chunks that miss the broader context needed to answer well.

Solution: Generate a broader "step-back" question first ("what are the general rules governing FHA loan eligibility?"), retrieve on that for context, then answer the specific question grounded in that context.

When to use: regulatory and policy domains where specific rules only make sense in the context of broader frameworks.

Retrieval Patterns

These patterns change the index structure and how chunks are stored and returned.

Parent-Child Retrieval

Problem: Small chunks retrieve with precision but lose context. Large chunks retain context but match poorly. You can't win with a single chunk size.

Solution: Index small child chunks for retrieval precision. Store large parent chunks in a document store. When a child chunk is retrieved, return its parent to the LLM.

Auto-Merging Retrieval

Problem: Parent-child retrieval always returns the parent, even when only one child matched. This can introduce irrelevant context from the parent's other sections.

Solution: Retrieve child chunks, then check: if ≥50% of a parent's children appear in the result set, merge up to the parent. Otherwise, keep the individual children.

Best for: long-form documents — contracts, manuals, regulatory filings — where individual chunks lose meaning without surrounding context.

Hierarchical Retrieval

Problem: At 10K+ documents, flat chunk search returns relevant chunks scattered across too many unrelated documents, adding noise.

Solution: Two-stage retrieval. First retrieve at the document/section level (coarse). Then retrieve chunks only within those top documents (fine).

When to use: knowledge bases with 10K+ documents. Below that, flat chunk search is simpler and fast enough.

Sentence Window Retrieval

Index at sentence granularity for maximum precision. When a sentence is retrieved, expand to a surrounding window (±2–3 sentences) before passing to the LLM.

Best for: dense documents where exact sentence-level matching matters (legal, medical, compliance).

Multi-Vector Retrieval

Store multiple representations per document in the index:

The full chunk text embedding
A generated summary embedding
Embeddings of hypothetical questions the document answers

Retrieve across all representations, deduplicate, rerank.

Best for: heterogeneous document types where a single embedding strategy misses different query patterns.

Architecture Patterns

These patterns change the overall system behavior — not just retrieval.

Agentic RAG

The LLM decides whether to retrieve, what to retrieve, and when to stop. Uses tool-calling. The retriever is a tool, not a fixed pipeline step.

When to use: multi-step reasoning, calculations alongside retrieval, or when different questions require fundamentally different retrieval strategies.

Watch out for: infinite loops and runaway tool calls. Always set a max iteration limit (typically 5–10).

Corrective RAG (CRAG)

After retrieval, a grader LLM evaluates chunk relevance. If the retrieved chunks score below a threshold, trigger a web search fallback before generating.

When to use: knowledge base coverage is incomplete or your domain changes faster than your indexing cadence.

Graph RAG

Builds a knowledge graph over documents — entities, relationships, communities. Retrieves by traversing the graph rather than by vector similarity.

When to use: highly interconnected domains where relationships between entities matter as much as the entities themselves — regulatory networks, medical ontologies, organizational hierarchies.

Cost: expensive to build and maintain the graph. Use only when vector retrieval demonstrably fails on relationship queries.

Open Source vs Azure — Implementation Map

Open Source Stack

Component	Tool
Orchestration + node parsers	LlamaIndex
Vector store	Qdrant (production) / Chroma (dev)
Keyword search	Elasticsearch / OpenSearch
Hybrid fusion	Custom RRF (LlamaIndex built-in)
Reranker	HuggingFace `cross-encoder/ms-marco-MiniLM-L-6-v2`
Doc store (parent nodes)	MongoDB / Redis
LLM	Ollama (Llama 3 / Mistral) or OpenAI API
Embedding model	`BAAI/bge-large-en-v1.5` (sentence-transformers)

Azure Stack

Component	Tool
Orchestration	Semantic Kernel / Azure AI Foundry
Vector + keyword + reranker	Azure AI Search (all three in one service)
Doc store (parent nodes)	Azure Cosmos DB (key-value lookup)
LLM	Azure OpenAI GPT-4o
Embedding model	Azure OpenAI `text-embedding-3-large`
Observability	Azure Monitor + Application Insights

Pattern-by-Pattern: Open Source vs Azure

Parent-Child

Open Source:

from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever

parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]  # grandparent → parent → child
)
nodes = parser.get_nodes_from_documents(documents)
# Leaf chunks → Qdrant. All nodes → MongoDB doc store.
storage_context = StorageContext.from_defaults(
    docstore=MongoDocumentStore.from_uri(mongo_uri),
    vector_store=QdrantVectorStore(client, collection_name="chunks")
)

Azure: Two AI Search indexes + Cosmos DB lookup:

Index 1 (child-chunks): chunk_id, parent_id, content, embedding [searchable]
Index 2 (parent-chunks): parent_id, full_content, metadata [lookup only]

Flow: Search child index → extract parent_ids → point-read Cosmos by parent_id

Auto-Merging

Open Source:

retriever = AutoMergingRetriever(
    vector_retriever,       # searches child chunks in Qdrant
    storage_context,        # must include docstore with all node levels
    simple_ratio_thresh=0.5 # merge if ≥50% of siblings retrieved
)

Azure (custom logic in Semantic Kernel plugin):

var childChunks = await _searchClient.SearchAsync<Chunk>(query, top: 20);
var grouped = childChunks.GroupBy(c => c.ParentId);

foreach (var group in grouped) {
    var totalSiblings = _parentMap[group.Key].ChildCount;
    if ((double)group.Count() / totalSiblings >= 0.5)
        yield return await _cosmos.GetParentAsync(group.Key); // merge up
    else
        foreach (var chunk in group) yield return chunk;
}

Hierarchical (Two-Stage)

Open Source:

# LlamaIndex RecursiveRetriever
# Summary index → per-document sub-indexes in Qdrant
summary_retriever = VectorIndexRetriever(summary_index, top_k=3)
recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": summary_retriever, **doc_retrievers},
    node_dict=all_nodes
)

Azure (native filter on AI Search):

Step 1: Search "doc-summaries" index → get top 3 doc IDs
Step 2: Search "chunks" index with $filter=doc_id in ['id1','id2','id3']
        (pre-filter runs before vector search — fast)
Step 3: Semantic ranker re-scores Step 2 results
Step 4: Pass to GPT-4o

When to Choose What

Scenario	Open Source	Azure
Startup / prototype	✓ Lower cost	—
Air-gapped / on-prem	✓ Only option	—
Fine-tuned embedding model	✓ Full control	—
ML team available	✓ LlamaIndex native	—
Enterprise + regulated (fintech, healthcare)	—	✓ Compliance built-in
.NET / C# primary stack	—	✓ Semantic Kernel native
No ML infra team	—	✓ Managed everything
Multi-region SLA	—	✓ Azure infra
Need hybrid search out-of-box	—	✓ AI Search native

What We Run in Production at MortgageIQ

MortgageIQ is a domain-grounded loan assistant on Azure OpenAI. Here's the actual stack:

Pattern in use: Parent-Child + Auto-Merging + Hybrid Search + Semantic Reranking

Why this combination:

Loan guidelines use precise codes ("CONV30", "FHA203K") — BM25 catches these, vector search misses them
Regulatory sections only make sense in context — parent-child prevents the LLM from seeing orphaned chunks
Auto-merging fires when a query touches multiple subsections of the same guideline — the LLM gets the full section, not fragments
Semantic ranker threshold at 0.70 — below that, we surface "no reliable information" rather than hallucinate. In a regulated lending context, a wrong answer is worse than no answer.

Numbers:

Retrieval miss rate on loan program codes: dropped from ~18% (pure vector) to under 2% (hybrid)
Context window utilization: 35% reduction after auto-merging replaced multi-chunk fragmentation
Compliance audit: every answer includes chunk source, document version, and retrieval score — full traceability

Which Pattern for Which Problem

Problem	Pattern
Short/vague queries	HyDE
Multi-hop analytical questions	Query Decomposition
Chunks lose meaning in isolation	Parent-Child
Context fragmentation at high recall	Auto-Merging
10K+ document knowledge base	Hierarchical
Precise sentence-level matching	Sentence Window
Multi-step reasoning + tool use	Agentic RAG
Incomplete knowledge base coverage	Corrective RAG (CRAG)
Entity relationship queries	Graph RAG
Everything else	Advanced RAG (hybrid + reranking)

Key Takeaways

No single pattern works for all queries — production systems combine two or three patterns based on query type and document structure.
Auto-merging and hierarchical solve the same root problem — chunk boundary artifacts — from opposite directions: bottom-up merge vs top-down filter.
Azure AI Search handles hybrid + reranking natively; parent-child and auto-merging require custom logic in Semantic Kernel regardless of cloud.
Open source gives you more control over the retrieval layer; Azure gives you compliance, SLA, and zero infra overhead — most enterprise teams land on Azure services + LlamaIndex/Semantic Kernel orchestration.
In regulated industries, a threshold filter that returns "I don't know" is a feature, not a limitation — it's your audit trail and your liability shield.

Coming Up in This Series

Day 3: Embedding Models — general vs domain-specific, fine-tuning triggers, Azure vs open source
Day 4: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
Day 5: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance