← All Posts
ai-mlApril 19, 2026ragchunkingllama-indexazure-ai-searchembeddingsretrievalenterprise-ai

Chunking Is the Most Underestimated Decision in RAG — Here's How to Get It Right

Wrong chunk size breaks retrieval before a single query runs. A complete guide to every chunking strategy — fixed-size, recursive, semantic, document-aware, late chunking — with open source and Azure implementation and what we run in production at MortgageIQ.

Your RAG system's retrieval quality is decided before a single query runs. It's decided when you chunk.

Chunk too small: you retrieve precise fragments that lack the context the LLM needs to answer correctly. Chunk too large: you retrieve entire sections where the relevant sentence is buried in noise — and you blow your context window on irrelevant text.

Most teams pick fixed-size chunking, set chunk_size=512, and never revisit it. Then they spend weeks tuning prompts and rerankers trying to fix a problem that was baked in at index time.

This is a complete guide to every chunking strategy used in production — how each one works, when it breaks, how it's implemented in open source and Azure stacks, and what we actually run at MortgageIQ.


Why Chunking Is Architectural, Not Operational

A retrieval system is only as good as what it can retrieve. The retriever finds chunks — not documents. The LLM sees chunks — not documents. Every downstream quality metric — precision, recall, faithfulness, answer relevance — is bounded by the quality of your chunks.

The core tension: embedding models produce better vectors for shorter, focused text. LLMs produce better answers with more context. Chunking is the negotiation between these two constraints.


The Chunk Size Tradeoff

Before any strategy, understand the fundamental tradeoff:

Blue: Retrieval Precision. Orange: Answer Context Quality.

Small chunks (64–256 tokens): high embedding precision, poor LLM context. Large chunks (1024–2048 tokens): poor embedding precision, rich LLM context. The sweet spot for most English enterprise text: 256–512 tokens for retrieval, with a parent-child strategy to restore context.

Token vs character count: 1 token ≈ 4 characters in English. A 512-token chunk ≈ 2,048 characters ≈ 350–400 words. Chunk size parameters in most frameworks accept tokens, not characters.


Strategy 1 — Fixed-Size Chunking

Split text into chunks of exactly N tokens, with optional overlap. The simplest strategy. The most commonly misused.

Overlap: a sliding window that includes the last N tokens of the previous chunk in the next chunk. Prevents information loss at boundaries — a sentence split across a boundary appears in both chunks.

Where it works:

  • Uniform documents — transcripts, logs, plain prose without structure
  • Rapid prototyping and baseline benchmarks
  • When you have no information about document structure

Where it breaks:

  • Structured documents — splits mid-table, mid-list, mid-code block
  • Documents with section boundaries — a chunk contains the end of one section and the start of the next, embedding a mixed semantic signal
  • Short sections — a 100-token section gets merged with unrelated content from the next section

The overlap myth: increasing overlap improves boundary coverage but increases index size and embedding cost proportionally. A 50-token overlap on 512-token chunks adds ~10% cost. A 200-token overlap adds ~40%. Always measure whether overlap actually improves your retrieval metrics before increasing it.

Open Source Implementation

from llama_index.core.node_parser import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separator=" "  # split on whitespace, not mid-word
)
nodes = splitter.get_nodes_from_documents(documents)

Azure Implementation

Azure AI Search indexers support fixed-size chunking via the Text Split skill in the AI enrichment pipeline:

{
  "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
  "textSplitMode": "pages",
  "maximumPageLength": 512,
  "pageOverlapLength": 50,
  "unit": "azureOpenAITokens",
  "inputs": [{ "name": "text", "source": "/document/content" }],
  "outputs": [{ "name": "textItems", "targetName": "chunks" }]
}

Azure-specific: the azureOpenAITokens unit uses the same tokenizer as text-embedding-3-large/small, ensuring chunk sizes are accurate for the embedding model you're using.


Strategy 2 — Recursive Character Splitting

Splits on a hierarchy of separators — paragraphs first, then sentences, then words, then characters — until chunks are within the size limit. LangChain's default. More structure-aware than fixed-size.

Where it works: plain prose documents with consistent paragraph structure — blog posts, reports, policy documents.

Where it breaks: technical documentation with mixed content (prose + code + tables). The separator hierarchy doesn't understand document semantics — it only sees characters.

Open Source Implementation

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]  # hierarchy: para → sentence → word → char
)
chunks = splitter.split_text(document_text)

Azure Implementation

Not natively supported as a single skill. Implement in a Custom Web API skill that wraps your own recursive splitter, called from the AI enrichment pipeline.


Strategy 3 — Document-Aware Chunking

Parses document structure before chunking — headings, sections, tables, lists, code blocks — and respects those boundaries. Each chunk belongs to exactly one semantic section.

Critical: tables and code blocks must never be split mid-row or mid-block. Document-aware chunking is the only strategy that guarantees this.

Where it works: structured documents — Markdown, HTML, Word/PDF with heading styles, API documentation, runbooks, regulatory filings with numbered sections.

Where it breaks: scanned PDFs or poorly formatted documents with no structure. PDFs exported from PowerPoint. Documents where headings are styled visually rather than structurally.

Open Source Implementation — Markdown

from llama_index.core.node_parser import MarkdownNodeParser

# Splits on heading boundaries — h1, h2, h3
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(documents)

# Each node includes section metadata
# node.metadata = {"header_path": "FHA Loans > Income Verification"}

Open Source Implementation — HTML

from langchain.text_splitter import HTMLHeaderTextSplitter

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[
        ("h1", "section"),
        ("h2", "subsection"),
        ("h3", "topic")
    ]
)
chunks = splitter.split_text(html_content)
# Metadata propagated: chunk.metadata["section"] = "FHA Loan Requirements"

Open Source Implementation — Code

from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

# Language-aware: splits on class → function → block boundaries
code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=512,
    chunk_overlap=0  # no overlap — code context doesn't benefit from overlap
)

Azure Implementation

Azure AI Search's Document Layout skill (preview) uses Azure AI Document Intelligence to extract structure from PDFs, Word, and HTML — including tables as structured objects:

{
  "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
  "inputs": [{ "name": "file_data", "source": "/document/file_data" }],
  "outputs": [
    { "name": "content", "targetName": "content" },
    { "name": "normalized_images", "targetName": "images" },
    { "name": "document_intelligence_output", "targetName": "layout" }
  ]
}

Tables are extracted as JSON objects with row/column structure intact — not flattened text. This is a significant advantage over open source PDF parsers for documents with complex tables (regulatory filings, loan guidelines, pricing matrices).


Strategy 4 — Semantic Chunking

Instead of splitting on token count or document structure, split where the meaning changes. Compute embedding similarity between consecutive sentences. When similarity drops below a threshold, that's a chunk boundary.

Where it works: long documents with clear topic shifts — research papers, long-form guides, compliance documents that cover multiple distinct regulations.

Where it breaks:

  • Slow — requires embedding every sentence during indexing (2–3x embedding cost vs fixed-size)
  • Inconsistent chunk sizes — some chunks are 2 sentences, others are 20. LLMs perform better with consistent context windows.
  • Threshold sensitivity — too high: over-splits, too low: merges unrelated content

The threshold problem: there's no universal threshold. 0.5 cosine similarity works for general text. Domain-specific text may require calibration. Always plot your similarity distribution before setting a threshold.

Open Source Implementation

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large")

splitter = SemanticSplitterNodeParser(
    buffer_size=1,               # sentences to compare on each side
    breakpoint_percentile_threshold=95,  # split at top 5% similarity drops
    embed_model=embed_model
)
nodes = splitter.get_nodes_from_documents(documents)

Azure Implementation

No native semantic chunking skill in Azure AI Search. Implement as a Custom Web API skill that calls your embedding model and computes similarity breakpoints, then returns chunk boundaries to the pipeline.

Cost consideration: semantic chunking on 50K documents × average 100 sentences = 5M embedding calls just for indexing. At $0.13/1M tokens for text-embedding-3-large, that's meaningful indexing cost. Use semantic chunking selectively.


Strategy 5 — Sentence Window Retrieval

Index at the sentence level. Store a surrounding window of sentences as metadata. When a sentence is retrieved, return the window to the LLM — not just the sentence.

Where it works: dense technical documents — legal, medical, regulatory — where exact sentence matching matters but the LLM needs surrounding sentences to interpret the answer.

Where it breaks: documents where adjacent sentences are unrelated (Q&A formats, lists). Window expansion adds noise instead of context.

Open Source Implementation

from llama_index.core.node_parser import SentenceWindowNodeParser

parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,              # 3 sentences before and after
    window_metadata_key="window",
    original_text_metadata_key="original_sentence"
)

# At query time — replace retrieved sentence with window
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

postprocessor = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

Strategy 6 — Late Chunking (Emerging)

Traditional chunking embeds chunks independently — each chunk's embedding has no awareness of the surrounding document. Late chunking inverts this: embed the entire document first (producing token-level embeddings with full document context), then pool token embeddings into chunk vectors.

Why it matters: pronouns and references resolve correctly. "It requires a 580 credit score" embeds with knowledge that "it" refers to an FHA loan — because the model saw the full document before pooling.

Current limitation: requires a long-context embedding model (bge-m3 at 8,192 tokens, Nomic embed-text-v1.5). Not yet supported natively in LlamaIndex or Azure AI Search — requires custom implementation. This is an emerging production pattern, not yet mainstream.


Strategy 7 — Hierarchical / Agentic Chunking

Multiple chunk sizes per document, stored in a tree. Retrieval starts at the coarse level (document summaries or large chunks) and drills down to fine-grained chunks within the top results. Covered in depth in the RAG Patterns post — applies here as a chunking strategy.


The Document Type Playbook

Different document types require different strategies. This is what enterprise RAG actually looks like:

Tables — The Special Case

Tables are the most common chunking failure in enterprise RAG. Splitting a table mid-row produces chunks that are semantically meaningless — a row without its header tells the LLM nothing.

Rule: always keep the header row in every table chunk. If a table is large, repeat the header on each chunk.

❌ Wrong:
Chunk 3: | $498,257 | Standard | 2025 |
         | $766,550 | High-cost | 2025 |

✓ Right:
Chunk 3: | Loan Limit | Area Type | Year |  ← header repeated
         | $498,257   | Standard  | 2025 |
         | $766,550   | High-cost | 2025 |

In Azure AI Search, the Document Layout skill extracts tables as structured JSON — each cell has row/column coordinates. Your indexing pipeline can reconstruct well-formed table chunks from this structure.


Open Source vs Azure — Full Comparison

Open Source Stack (LlamaIndex)

StrategyParser / ClassNotes
Fixed-sizeTokenTextSplitterToken-accurate, configurable separators
RecursiveRecursiveCharacterTextSplitter (LangChain)Character-based, hierarchy of separators
Document-aware (Markdown)MarkdownNodeParserSplits on heading boundaries
Document-aware (HTML)HTMLNodeParserPreserves tag structure
Document-aware (Code)CodeSplitterLanguage-aware (Python, JS, Go, etc.)
SemanticSemanticSplitterNodeParserEmbedding-based boundary detection
Sentence windowSentenceWindowNodeParserIndex sentence, return window
HierarchicalHierarchicalNodeParserMulti-level tree (128/512/2048)
Auto-mergingAutoMergingRetrieverWorks with HierarchicalNodeParser
PDF extractionunstructured.io + pymupdfOpen source PDF parsing

LlamaIndex pipeline — multiple strategies per document type:

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import (
    MarkdownNodeParser,
    HierarchicalNodeParser,
    CodeSplitter
)
from llama_index.core.ingestion import IngestionPipeline

# Route documents to appropriate parser by file type
md_pipeline = IngestionPipeline(transformations=[
    MarkdownNodeParser(),
    # add metadata: section path, doc source, version
])

code_pipeline = IngestionPipeline(transformations=[
    CodeSplitter(language="python", chunk_lines=40, chunk_lines_overlap=5)
])

hierarchical_pipeline = IngestionPipeline(transformations=[
    HierarchicalNodeParser.from_defaults(chunk_sizes=[2048, 512, 128])
])

Azure Stack

StrategyAzure ComponentNotes
Fixed-sizeText Split skillNative, token-accurate for OpenAI models
Document-aware (PDF/Word)Document Layout skill (AI Document Intelligence)Table extraction as JSON, heading detection
Document-aware (HTML)Text Split skill + custom preprocessingNative HTML stripping, then split
Custom strategiesCustom Web API skillAny Python chunking logic callable from pipeline
OCR → textOCR skill → Text Split skillFor scanned PDFs
Multi-languageLanguage Detection skill → route to correct splitterAuto-detect language, apply language-specific splitting

Azure AI Search indexer pipeline — document-type routing:

{
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Util.ConditionalSkill",
      "context": "/document",
      "inputs": [
        { "name": "condition", "source": "= $(/document/file_extension) == '.pdf'" },
        { "name": "whenTrue", "source": "/document/content" }
      ],
      "outputs": [{ "name": "output", "targetName": "pdf_content" }]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "context": "/document/pdf_content",
      "textSplitMode": "pages",
      "maximumPageLength": 512,
      "pageOverlapLength": 50,
      "unit": "azureOpenAITokens"
    }
  ]
}

Azure-specific advantages:

  • Document Intelligence handles complex PDF layouts (multi-column, embedded tables, forms) that open source parsers miss
  • Language Detection skill enables automatic language-routing before splitting — critical for multilingual document corpora
  • Integrated pipeline: extract → chunk → embed → index in one managed job with retry, error handling, and monitoring via Azure Monitor

Metadata Is Half the Chunk

Every chunk must carry metadata. Without metadata, retrieval is a black box — you can't filter, you can't cite sources, and you can't debug wrong answers.

Minimum metadata per chunk:

{
    "chunk_id": "doc_001_chunk_042",
    "doc_id": "fha-guidelines-2025",
    "doc_title": "FHA Single Family Handbook",
    "doc_version": "2025-Q1",
    "section": "Income Verification Requirements",
    "section_path": "Eligibility > Income > Verification",
    "page_number": 47,
    "chunk_index": 42,
    "total_chunks": 318,
    "char_count": 1847,
    "token_count": 461,
    "language": "en",
    "indexed_at": "2026-04-19T10:23:00Z"
}

Why this matters in enterprise systems:

  • doc_version — when guidelines update, you can invalidate and re-index only changed documents, not the entire corpus
  • section_path — enables hierarchical filtering ("only search within Income Verification sections")
  • page_number — citation back to source document for compliance audit
  • language — enables language-filtered retrieval for multilingual corpora

In Azure AI Search, metadata fields become searchable/filterable index fields — you can filter $filter=doc_version eq '2025-Q1' before the vector search runs. This dramatically reduces the search space and improves precision.


Evaluating Your Chunking Strategy

You can't tune what you don't measure. These are the signals that tell you your chunking is wrong:

SignalWhat it meansFix
Low context recall (RAGAS)Right document retrieved, but answer not in the chunksChunk too small — increase size or add parent-child
Low answer relevanceLLM answers from wrong topic within a chunkChunk too large — mixed topics per chunk
High faithfulness, low completenessAnswer is correct but incompleteAdjacent chunks split a complete answer — add overlap or adjust boundaries
Exact term miss rate > 5%Loan codes, IDs not retrievedAdd BM25 alongside vector search
Irrelevant citations in answersNoisy chunks contaminate contextChunk too large or semantic boundaries not respected

Practical eval setup:

from ragas import evaluate
from ragas.metrics import context_recall, context_precision, faithfulness

# Build a test set: 50–100 questions with known source chunks
test_questions = [
    {"question": "What is the FHA down payment requirement?",
     "ground_truth": "FHA loans require a minimum 3.5% down payment..."}
]

results = evaluate(
    dataset=test_questions,
    metrics=[context_recall, context_precision, faithfulness],
    llm=your_llm,
    embeddings=your_embed_model
)
# context_recall < 0.7 → chunking problem
# context_precision < 0.7 → retrieval / reranking problem

Run a chunking A/B test before committing to a strategy:

  1. Index the same corpus with two strategies
  2. Run the same 50 test queries against both indexes
  3. Compare context_recall and context_precision
  4. Pick the winner — then tune overlap and size within that strategy

What We Run in Production at MortgageIQ

MortgageIQ has four distinct document types. Each uses a different chunking strategy.

What this means in practice:

  • Guidelines: parent-child with Azure Doc Intelligence. Azure handles the complex table extraction (DTI limit tables, loan limit matrices). Parent chunks (2048 tokens, full sections) live in Cosmos DB. Child chunks (512 tokens) in AI Search. Auto-merge fires when a query touches multiple subsections of the same guideline.

  • Regulatory filings: sentence window with document-aware splitting. RESPA and TRID rules are dense, cross-referential text. Sentence-level indexing with ±2 sentence window gives the LLM enough regulatory context without noise.

  • FAQs: Q+A pairs kept together. Splitting a question from its answer is the most common FAQ chunking mistake. Each chunk = one Q+A pair, tagged with the question as metadata for hybrid search.

  • Rate sheets: row-level with header repetition. Rate sheets update daily — lightweight chunks with full version metadata enable targeted re-indexing of changed rows only, not full re-index.

Results after switching from fixed-size (512) to document-type routing:

  • Context recall on guideline queries: 0.61 → 0.84
  • Table-based query accuracy: 0.43 → 0.91 (Azure Doc Intelligence table extraction was the decisive change)
  • Re-indexing time for daily rate sheet updates: 4 minutes → 22 seconds (row-level chunks vs full document re-index)

Key Takeaways

  • Chunking is decided once at index time and affects every query forever — treat it as architecture, not configuration. Wrong chunk boundaries compound into every downstream retrieval failure.
  • No single strategy works for all document types — production RAG systems route documents to different chunking strategies based on file type and structure.
  • Tables are the most common chunking failure — always preserve header rows and keep rows with their table. Azure Document Intelligence handles this better than any open source PDF parser.
  • Metadata is half the chunk — without doc_version, section_path, and page_number, you can't filter, cite, debug, or audit. Build metadata from the start.
  • Measure before you tune — run a 50-question eval with context_recall and context_precision before and after changing your strategy. Gut feel is not a chunking metric.

Part 2 — Beyond Documents

This post covers chunking strategies for document-based sources: PDFs, Markdown, HTML, Word, and code. But enterprise knowledge doesn't live only in documents.

Part 2: Your Knowledge Isn't in PDFs — How to Index Every Enterprise Data Source into RAG covers chunking and embedding for SQL databases, SharePoint, Outlook email threads, Microsoft Teams conversations, REST APIs, JIRA tickets, Git repositories, and scanned images — including the ingestion architecture that ties all sources into a unified pipeline.


Coming Up in This Series

  • Day 5: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test in production
  • Day 6: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance