There are 13 RAG patterns. Most teams pick one, apply it everywhere, and wonder why production fails.
The pattern you choose determines your retrieval precision, your latency budget, your infrastructure complexity, and whether your system can explain its answers to a compliance officer. Getting this wrong isn't a tuning problem — it's an architecture problem.
Here's the full map, how each one is implemented in open source and Azure, and what we run in production.
The Pattern Landscape
RAG patterns fall into four categories. Start with the category, then pick the pattern.
Foundational Patterns
Naive RAG
Retrieve → stuff into context → generate. No reranking, no query transformation, no threshold filtering.
Works in demos. Breaks in production when users ask questions in their own words instead of the document's vocabulary.
Don't ship this.
Advanced RAG
The baseline for any production system. Adds query rewriting, hybrid search (BM25 + vector), cross-encoder reranking, and relevance score thresholding on top of naive RAG.
Day 1 of this series covers Advanced RAG in depth. If you haven't implemented this yet, start here before anything else.
Query Patterns
These patterns change what gets sent to the retriever — not the retriever itself.
HyDE — Hypothetical Document Embeddings
Problem: Short or vague queries don't embed well. "FHA limits" as a query vector has weak signal. The embedding model doesn't have enough context to place it accurately in vector space.
Solution: Ask the LLM to generate a hypothetical answer first, embed that, and use the hypothetical answer's vector to retrieve real documents.
When to use: queries are consistently short or domain vocabulary is specialized enough that the query vector alone has poor precision.
Watch out for: HyDE hallucinations — if the LLM generates a wrong hypothetical, you retrieve wrong documents confidently. Always validate with a threshold filter on the final results.
Query Decomposition
Problem: Multi-hop questions ("Compare FHA and conventional loan down payment requirements for first-time buyers") require information from multiple documents. A single vector query returns a mixed bag.
Solution: Use the LLM to decompose the question into atomic sub-queries, retrieve independently for each, merge results, then synthesize.
When to use: your users ask analytical or comparative questions, not lookup questions.
Cost: 2–4x more LLM calls and retrieval passes. Set a hard limit on sub-query count (3–5 max).
Step-Back Prompting
Problem: Specific questions return specific chunks that miss the broader context needed to answer well.
Solution: Generate a broader "step-back" question first ("what are the general rules governing FHA loan eligibility?"), retrieve on that for context, then answer the specific question grounded in that context.
When to use: regulatory and policy domains where specific rules only make sense in the context of broader frameworks.
Retrieval Patterns
These patterns change the index structure and how chunks are stored and returned.
Parent-Child Retrieval
Problem: Small chunks retrieve with precision but lose context. Large chunks retain context but match poorly. You can't win with a single chunk size.
Solution: Index small child chunks for retrieval precision. Store large parent chunks in a document store. When a child chunk is retrieved, return its parent to the LLM.
Auto-Merging Retrieval
Problem: Parent-child retrieval always returns the parent, even when only one child matched. This can introduce irrelevant context from the parent's other sections.
Solution: Retrieve child chunks, then check: if ≥50% of a parent's children appear in the result set, merge up to the parent. Otherwise, keep the individual children.
Best for: long-form documents — contracts, manuals, regulatory filings — where individual chunks lose meaning without surrounding context.
Hierarchical Retrieval
Problem: At 10K+ documents, flat chunk search returns relevant chunks scattered across too many unrelated documents, adding noise.
Solution: Two-stage retrieval. First retrieve at the document/section level (coarse). Then retrieve chunks only within those top documents (fine).
When to use: knowledge bases with 10K+ documents. Below that, flat chunk search is simpler and fast enough.
Sentence Window Retrieval
Index at sentence granularity for maximum precision. When a sentence is retrieved, expand to a surrounding window (±2–3 sentences) before passing to the LLM.
Best for: dense documents where exact sentence-level matching matters (legal, medical, compliance).
Multi-Vector Retrieval
Store multiple representations per document in the index:
- The full chunk text embedding
- A generated summary embedding
- Embeddings of hypothetical questions the document answers
Retrieve across all representations, deduplicate, rerank.
Best for: heterogeneous document types where a single embedding strategy misses different query patterns.
Architecture Patterns
These patterns change the overall system behavior — not just retrieval.
Agentic RAG
The LLM decides whether to retrieve, what to retrieve, and when to stop. Uses tool-calling. The retriever is a tool, not a fixed pipeline step.
When to use: multi-step reasoning, calculations alongside retrieval, or when different questions require fundamentally different retrieval strategies.
Watch out for: infinite loops and runaway tool calls. Always set a max iteration limit (typically 5–10).
Corrective RAG (CRAG)
After retrieval, a grader LLM evaluates chunk relevance. If the retrieved chunks score below a threshold, trigger a web search fallback before generating.
When to use: knowledge base coverage is incomplete or your domain changes faster than your indexing cadence.
Graph RAG
Builds a knowledge graph over documents — entities, relationships, communities. Retrieves by traversing the graph rather than by vector similarity.
When to use: highly interconnected domains where relationships between entities matter as much as the entities themselves — regulatory networks, medical ontologies, organizational hierarchies.
Cost: expensive to build and maintain the graph. Use only when vector retrieval demonstrably fails on relationship queries.
Open Source vs Azure — Implementation Map
Open Source Stack
| Component | Tool |
|---|---|
| Orchestration + node parsers | LlamaIndex |
| Vector store | Qdrant (production) / Chroma (dev) |
| Keyword search | Elasticsearch / OpenSearch |
| Hybrid fusion | Custom RRF (LlamaIndex built-in) |
| Reranker | HuggingFace cross-encoder/ms-marco-MiniLM-L-6-v2 |
| Doc store (parent nodes) | MongoDB / Redis |
| LLM | Ollama (Llama 3 / Mistral) or OpenAI API |
| Embedding model | BAAI/bge-large-en-v1.5 (sentence-transformers) |
Azure Stack
| Component | Tool |
|---|---|
| Orchestration | Semantic Kernel / Azure AI Foundry |
| Vector + keyword + reranker | Azure AI Search (all three in one service) |
| Doc store (parent nodes) | Azure Cosmos DB (key-value lookup) |
| LLM | Azure OpenAI GPT-4o |
| Embedding model | Azure OpenAI text-embedding-3-large |
| Observability | Azure Monitor + Application Insights |
Pattern-by-Pattern: Open Source vs Azure
Parent-Child
Open Source:
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128] # grandparent → parent → child
)
nodes = parser.get_nodes_from_documents(documents)
# Leaf chunks → Qdrant. All nodes → MongoDB doc store.
storage_context = StorageContext.from_defaults(
docstore=MongoDocumentStore.from_uri(mongo_uri),
vector_store=QdrantVectorStore(client, collection_name="chunks")
)
Azure: Two AI Search indexes + Cosmos DB lookup:
Index 1 (child-chunks): chunk_id, parent_id, content, embedding [searchable]
Index 2 (parent-chunks): parent_id, full_content, metadata [lookup only]
Flow: Search child index → extract parent_ids → point-read Cosmos by parent_id
Auto-Merging
Open Source:
retriever = AutoMergingRetriever(
vector_retriever, # searches child chunks in Qdrant
storage_context, # must include docstore with all node levels
simple_ratio_thresh=0.5 # merge if ≥50% of siblings retrieved
)
Azure (custom logic in Semantic Kernel plugin):
var childChunks = await _searchClient.SearchAsync<Chunk>(query, top: 20);
var grouped = childChunks.GroupBy(c => c.ParentId);
foreach (var group in grouped) {
var totalSiblings = _parentMap[group.Key].ChildCount;
if ((double)group.Count() / totalSiblings >= 0.5)
yield return await _cosmos.GetParentAsync(group.Key); // merge up
else
foreach (var chunk in group) yield return chunk;
}
Hierarchical (Two-Stage)
Open Source:
# LlamaIndex RecursiveRetriever
# Summary index → per-document sub-indexes in Qdrant
summary_retriever = VectorIndexRetriever(summary_index, top_k=3)
recursive_retriever = RecursiveRetriever(
"vector",
retriever_dict={"vector": summary_retriever, **doc_retrievers},
node_dict=all_nodes
)
Azure (native filter on AI Search):
Step 1: Search "doc-summaries" index → get top 3 doc IDs
Step 2: Search "chunks" index with $filter=doc_id in ['id1','id2','id3']
(pre-filter runs before vector search — fast)
Step 3: Semantic ranker re-scores Step 2 results
Step 4: Pass to GPT-4o
When to Choose What
| Scenario | Open Source | Azure |
|---|---|---|
| Startup / prototype | ✓ Lower cost | — |
| Air-gapped / on-prem | ✓ Only option | — |
| Fine-tuned embedding model | ✓ Full control | — |
| ML team available | ✓ LlamaIndex native | — |
| Enterprise + regulated (fintech, healthcare) | — | ✓ Compliance built-in |
| .NET / C# primary stack | — | ✓ Semantic Kernel native |
| No ML infra team | — | ✓ Managed everything |
| Multi-region SLA | — | ✓ Azure infra |
| Need hybrid search out-of-box | — | ✓ AI Search native |
What We Run in Production at MortgageIQ
MortgageIQ is a domain-grounded loan assistant on Azure OpenAI. Here's the actual stack:
Pattern in use: Parent-Child + Auto-Merging + Hybrid Search + Semantic Reranking
Why this combination:
- Loan guidelines use precise codes ("CONV30", "FHA203K") — BM25 catches these, vector search misses them
- Regulatory sections only make sense in context — parent-child prevents the LLM from seeing orphaned chunks
- Auto-merging fires when a query touches multiple subsections of the same guideline — the LLM gets the full section, not fragments
- Semantic ranker threshold at 0.70 — below that, we surface "no reliable information" rather than hallucinate. In a regulated lending context, a wrong answer is worse than no answer.
Numbers:
- Retrieval miss rate on loan program codes: dropped from ~18% (pure vector) to under 2% (hybrid)
- Context window utilization: 35% reduction after auto-merging replaced multi-chunk fragmentation
- Compliance audit: every answer includes chunk source, document version, and retrieval score — full traceability
Which Pattern for Which Problem
| Problem | Pattern |
|---|---|
| Short/vague queries | HyDE |
| Multi-hop analytical questions | Query Decomposition |
| Chunks lose meaning in isolation | Parent-Child |
| Context fragmentation at high recall | Auto-Merging |
| 10K+ document knowledge base | Hierarchical |
| Precise sentence-level matching | Sentence Window |
| Multi-step reasoning + tool use | Agentic RAG |
| Incomplete knowledge base coverage | Corrective RAG (CRAG) |
| Entity relationship queries | Graph RAG |
| Everything else | Advanced RAG (hybrid + reranking) |
Key Takeaways
- No single pattern works for all queries — production systems combine two or three patterns based on query type and document structure.
- Auto-merging and hierarchical solve the same root problem — chunk boundary artifacts — from opposite directions: bottom-up merge vs top-down filter.
- Azure AI Search handles hybrid + reranking natively; parent-child and auto-merging require custom logic in Semantic Kernel regardless of cloud.
- Open source gives you more control over the retrieval layer; Azure gives you compliance, SLA, and zero infra overhead — most enterprise teams land on Azure services + LlamaIndex/Semantic Kernel orchestration.
- In regulated industries, a threshold filter that returns "I don't know" is a feature, not a limitation — it's your audit trail and your liability shield.
Coming Up in This Series
- Day 3: Embedding Models — general vs domain-specific, fine-tuning triggers, Azure vs open source
- Day 4: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
- Day 5: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance