← All Posts
ai-mlApril 19, 2026ragretrievalllama-indexsemantic-kernelazure-ai-searchembeddingshybrid-searchrerankingenterprise-ai

Every RAG Pattern Explained — and Which One to Run in Production

From naive RAG to auto-merging and hierarchical retrieval — every major pattern mapped to real open source and Azure tooling, plus what we run in production at MortgageIQ.

There are 13 RAG patterns. Most teams pick one, apply it everywhere, and wonder why production fails.

The pattern you choose determines your retrieval precision, your latency budget, your infrastructure complexity, and whether your system can explain its answers to a compliance officer. Getting this wrong isn't a tuning problem — it's an architecture problem.

Here's the full map, how each one is implemented in open source and Azure, and what we run in production.


The Pattern Landscape

RAG patterns fall into four categories. Start with the category, then pick the pattern.


Foundational Patterns

Naive RAG

Retrieve → stuff into context → generate. No reranking, no query transformation, no threshold filtering.

Works in demos. Breaks in production when users ask questions in their own words instead of the document's vocabulary.

Don't ship this.

Advanced RAG

The baseline for any production system. Adds query rewriting, hybrid search (BM25 + vector), cross-encoder reranking, and relevance score thresholding on top of naive RAG.

Day 1 of this series covers Advanced RAG in depth. If you haven't implemented this yet, start here before anything else.


Query Patterns

These patterns change what gets sent to the retriever — not the retriever itself.

HyDE — Hypothetical Document Embeddings

Problem: Short or vague queries don't embed well. "FHA limits" as a query vector has weak signal. The embedding model doesn't have enough context to place it accurately in vector space.

Solution: Ask the LLM to generate a hypothetical answer first, embed that, and use the hypothetical answer's vector to retrieve real documents.

When to use: queries are consistently short or domain vocabulary is specialized enough that the query vector alone has poor precision.

Watch out for: HyDE hallucinations — if the LLM generates a wrong hypothetical, you retrieve wrong documents confidently. Always validate with a threshold filter on the final results.


Query Decomposition

Problem: Multi-hop questions ("Compare FHA and conventional loan down payment requirements for first-time buyers") require information from multiple documents. A single vector query returns a mixed bag.

Solution: Use the LLM to decompose the question into atomic sub-queries, retrieve independently for each, merge results, then synthesize.

When to use: your users ask analytical or comparative questions, not lookup questions.

Cost: 2–4x more LLM calls and retrieval passes. Set a hard limit on sub-query count (3–5 max).


Step-Back Prompting

Problem: Specific questions return specific chunks that miss the broader context needed to answer well.

Solution: Generate a broader "step-back" question first ("what are the general rules governing FHA loan eligibility?"), retrieve on that for context, then answer the specific question grounded in that context.

When to use: regulatory and policy domains where specific rules only make sense in the context of broader frameworks.


Retrieval Patterns

These patterns change the index structure and how chunks are stored and returned.

Parent-Child Retrieval

Problem: Small chunks retrieve with precision but lose context. Large chunks retain context but match poorly. You can't win with a single chunk size.

Solution: Index small child chunks for retrieval precision. Store large parent chunks in a document store. When a child chunk is retrieved, return its parent to the LLM.


Auto-Merging Retrieval

Problem: Parent-child retrieval always returns the parent, even when only one child matched. This can introduce irrelevant context from the parent's other sections.

Solution: Retrieve child chunks, then check: if ≥50% of a parent's children appear in the result set, merge up to the parent. Otherwise, keep the individual children.

Best for: long-form documents — contracts, manuals, regulatory filings — where individual chunks lose meaning without surrounding context.


Hierarchical Retrieval

Problem: At 10K+ documents, flat chunk search returns relevant chunks scattered across too many unrelated documents, adding noise.

Solution: Two-stage retrieval. First retrieve at the document/section level (coarse). Then retrieve chunks only within those top documents (fine).

When to use: knowledge bases with 10K+ documents. Below that, flat chunk search is simpler and fast enough.


Sentence Window Retrieval

Index at sentence granularity for maximum precision. When a sentence is retrieved, expand to a surrounding window (±2–3 sentences) before passing to the LLM.

Best for: dense documents where exact sentence-level matching matters (legal, medical, compliance).


Multi-Vector Retrieval

Store multiple representations per document in the index:

  • The full chunk text embedding
  • A generated summary embedding
  • Embeddings of hypothetical questions the document answers

Retrieve across all representations, deduplicate, rerank.

Best for: heterogeneous document types where a single embedding strategy misses different query patterns.


Architecture Patterns

These patterns change the overall system behavior — not just retrieval.

Agentic RAG

The LLM decides whether to retrieve, what to retrieve, and when to stop. Uses tool-calling. The retriever is a tool, not a fixed pipeline step.

When to use: multi-step reasoning, calculations alongside retrieval, or when different questions require fundamentally different retrieval strategies.

Watch out for: infinite loops and runaway tool calls. Always set a max iteration limit (typically 5–10).


Corrective RAG (CRAG)

After retrieval, a grader LLM evaluates chunk relevance. If the retrieved chunks score below a threshold, trigger a web search fallback before generating.

When to use: knowledge base coverage is incomplete or your domain changes faster than your indexing cadence.


Graph RAG

Builds a knowledge graph over documents — entities, relationships, communities. Retrieves by traversing the graph rather than by vector similarity.

When to use: highly interconnected domains where relationships between entities matter as much as the entities themselves — regulatory networks, medical ontologies, organizational hierarchies.

Cost: expensive to build and maintain the graph. Use only when vector retrieval demonstrably fails on relationship queries.


Open Source vs Azure — Implementation Map

Open Source Stack

ComponentTool
Orchestration + node parsersLlamaIndex
Vector storeQdrant (production) / Chroma (dev)
Keyword searchElasticsearch / OpenSearch
Hybrid fusionCustom RRF (LlamaIndex built-in)
RerankerHuggingFace cross-encoder/ms-marco-MiniLM-L-6-v2
Doc store (parent nodes)MongoDB / Redis
LLMOllama (Llama 3 / Mistral) or OpenAI API
Embedding modelBAAI/bge-large-en-v1.5 (sentence-transformers)

Azure Stack

ComponentTool
OrchestrationSemantic Kernel / Azure AI Foundry
Vector + keyword + rerankerAzure AI Search (all three in one service)
Doc store (parent nodes)Azure Cosmos DB (key-value lookup)
LLMAzure OpenAI GPT-4o
Embedding modelAzure OpenAI text-embedding-3-large
ObservabilityAzure Monitor + Application Insights

Pattern-by-Pattern: Open Source vs Azure

Parent-Child

Open Source:

from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever

parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]  # grandparent → parent → child
)
nodes = parser.get_nodes_from_documents(documents)
# Leaf chunks → Qdrant. All nodes → MongoDB doc store.
storage_context = StorageContext.from_defaults(
    docstore=MongoDocumentStore.from_uri(mongo_uri),
    vector_store=QdrantVectorStore(client, collection_name="chunks")
)

Azure: Two AI Search indexes + Cosmos DB lookup:

Index 1 (child-chunks): chunk_id, parent_id, content, embedding [searchable]
Index 2 (parent-chunks): parent_id, full_content, metadata [lookup only]

Flow: Search child index → extract parent_ids → point-read Cosmos by parent_id

Auto-Merging

Open Source:

retriever = AutoMergingRetriever(
    vector_retriever,       # searches child chunks in Qdrant
    storage_context,        # must include docstore with all node levels
    simple_ratio_thresh=0.5 # merge if ≥50% of siblings retrieved
)

Azure (custom logic in Semantic Kernel plugin):

var childChunks = await _searchClient.SearchAsync<Chunk>(query, top: 20);
var grouped = childChunks.GroupBy(c => c.ParentId);

foreach (var group in grouped) {
    var totalSiblings = _parentMap[group.Key].ChildCount;
    if ((double)group.Count() / totalSiblings >= 0.5)
        yield return await _cosmos.GetParentAsync(group.Key); // merge up
    else
        foreach (var chunk in group) yield return chunk;
}

Hierarchical (Two-Stage)

Open Source:

# LlamaIndex RecursiveRetriever
# Summary index → per-document sub-indexes in Qdrant
summary_retriever = VectorIndexRetriever(summary_index, top_k=3)
recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": summary_retriever, **doc_retrievers},
    node_dict=all_nodes
)

Azure (native filter on AI Search):

Step 1: Search "doc-summaries" index → get top 3 doc IDs
Step 2: Search "chunks" index with $filter=doc_id in ['id1','id2','id3']
        (pre-filter runs before vector search — fast)
Step 3: Semantic ranker re-scores Step 2 results
Step 4: Pass to GPT-4o

When to Choose What

ScenarioOpen SourceAzure
Startup / prototype✓ Lower cost
Air-gapped / on-prem✓ Only option
Fine-tuned embedding model✓ Full control
ML team available✓ LlamaIndex native
Enterprise + regulated (fintech, healthcare)✓ Compliance built-in
.NET / C# primary stack✓ Semantic Kernel native
No ML infra team✓ Managed everything
Multi-region SLA✓ Azure infra
Need hybrid search out-of-box✓ AI Search native

What We Run in Production at MortgageIQ

MortgageIQ is a domain-grounded loan assistant on Azure OpenAI. Here's the actual stack:

Pattern in use: Parent-Child + Auto-Merging + Hybrid Search + Semantic Reranking

Why this combination:

  • Loan guidelines use precise codes ("CONV30", "FHA203K") — BM25 catches these, vector search misses them
  • Regulatory sections only make sense in context — parent-child prevents the LLM from seeing orphaned chunks
  • Auto-merging fires when a query touches multiple subsections of the same guideline — the LLM gets the full section, not fragments
  • Semantic ranker threshold at 0.70 — below that, we surface "no reliable information" rather than hallucinate. In a regulated lending context, a wrong answer is worse than no answer.

Numbers:

  • Retrieval miss rate on loan program codes: dropped from ~18% (pure vector) to under 2% (hybrid)
  • Context window utilization: 35% reduction after auto-merging replaced multi-chunk fragmentation
  • Compliance audit: every answer includes chunk source, document version, and retrieval score — full traceability

Which Pattern for Which Problem

ProblemPattern
Short/vague queriesHyDE
Multi-hop analytical questionsQuery Decomposition
Chunks lose meaning in isolationParent-Child
Context fragmentation at high recallAuto-Merging
10K+ document knowledge baseHierarchical
Precise sentence-level matchingSentence Window
Multi-step reasoning + tool useAgentic RAG
Incomplete knowledge base coverageCorrective RAG (CRAG)
Entity relationship queriesGraph RAG
Everything elseAdvanced RAG (hybrid + reranking)

Key Takeaways

  • No single pattern works for all queries — production systems combine two or three patterns based on query type and document structure.
  • Auto-merging and hierarchical solve the same root problem — chunk boundary artifacts — from opposite directions: bottom-up merge vs top-down filter.
  • Azure AI Search handles hybrid + reranking natively; parent-child and auto-merging require custom logic in Semantic Kernel regardless of cloud.
  • Open source gives you more control over the retrieval layer; Azure gives you compliance, SLA, and zero infra overhead — most enterprise teams land on Azure services + LlamaIndex/Semantic Kernel orchestration.
  • In regulated industries, a threshold filter that returns "I don't know" is a feature, not a limitation — it's your audit trail and your liability shield.

Coming Up in This Series

  • Day 3: Embedding Models — general vs domain-specific, fine-tuning triggers, Azure vs open source
  • Day 4: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
  • Day 5: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance