import Details from ‘@theme/Details’;

Chunking Strategies#

The most underrated step in RAG. Better chunking → better retrieval → better answers. No amount of fancy reranking rescues bad chunks.


Why Chunking Matters#

Your embedding model has a token limit (typically 512–8192 tokens). You can’t embed an entire PDF at once. But chunks that are too small lose context; chunks too large dilute the signal.

The goal: chunks that are semantically coherent and self-contained enough to answer a question on their own.

graph TD
  A[Raw Document] --> B{Chunking Strategy}
  B --> C[Too Small: loses context]
  B --> D[Too Large: dilutes signal]
  B --> E[Just Right: coherent + retrievable]

The Five Strategies#

1. Fixed-Size Chunking (Baseline)#

Split every N characters or tokens, with M tokens of overlap.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,       # characters per chunk
    chunk_overlap=50,     # overlap between consecutive chunks
    separators=["\n\n", "\n", ". ", " ", ""]  # try these in order
)

text = open("document.txt").read()
chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
print(chunks[0])

Pros: Simple, fast, works everywhere.
Cons: Splits mid-sentence, mid-thought. Loses paragraph/section structure.

Best for: Quick prototyping, homogeneous plain-text corpora.


2. Sentence / Paragraph Chunking#

Respect natural text boundaries — sentences and paragraphs.

import nltk
nltk.download('punkt')

from langchain.text_splitter import NLTKTextSplitter

splitter = NLTKTextSplitter(chunk_size=500)
chunks = splitter.split_text(text)

Or with spaCy for smarter sentence detection:

import spacy
nlp = spacy.load("en_core_web_sm")

def sentence_chunks(text, max_tokens=200):
    doc = nlp(text)
    chunks, current, count = [], [], 0
    for sent in doc.sents:
        tokens = len(sent)
        if count + tokens > max_tokens and current:
            chunks.append(" ".join(s.text for s in current))
            current, count = [], 0
        current.append(sent)
        count += tokens
    if current:
        chunks.append(" ".join(s.text for s in current))
    return chunks

3. Header-Based (Markdown / Structured Documents)#

Split on Markdown headers to preserve section structure.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
docs = splitter.split_text(markdown_text)

# Each doc carries metadata about which section it belongs to
for doc in docs[:3]:
    print(doc.metadata)  # {'h1': 'Introduction', 'h2': 'Background'}
    print(doc.page_content[:100])
    print("---")

Best for: Documentation sites, course notes, textbooks with clear structure.


4. Parent-Child Chunking#

Store large “parent” chunks for context, index small “child” chunks for retrieval. When a child chunk matches, return the parent.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Parent = big chunks (for LLM context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# Child = small chunks (for embedding + retrieval)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Index documents
from langchain.document_loaders import TextLoader
loader = TextLoader("document.txt")
docs = loader.load()
retriever.add_documents(docs)

# Retrieve — small chunk matches, but returns big parent
results = retriever.invoke("What is chunking?")
print(f"Returned {len(results)} parent chunks")
print(f"Parent chunk length: {len(results[0].page_content)} chars")

Best for: Long documents where you need precision in retrieval but rich context for generation.


5. Semantic Chunking#

Group sentences by semantic similarity — split when the meaning shifts.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Uses embedding distance between consecutive sentences to decide split points
splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",   # or "standard_deviation"
    breakpoint_threshold_amount=95,           # split when > 95th percentile shift
)

chunks = splitter.split_text(text)
print(f"Semantic chunks: {len(chunks)}")

Best for: Heterogeneous documents, articles that shift topics mid-page.
Cons: Slower (requires embedding every sentence), higher API cost.


Token-Based Chunking (Production-Grade)#

Always chunk by tokens, not characters — your embedding model counts tokens.

import tiktoken

def token_chunker(text: str, model: str = "text-embedding-3-small",
                  max_tokens: int = 512, overlap: int = 50):
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start += max_tokens - overlap
    return chunks

chunks = token_chunker(text, max_tokens=512, overlap=50)
print(f"Token-based chunks: {len(chunks)}")

Strategy Decision Guide#

Document TypeBest Strategy
Plain text, articlesRecursive character + overlap
Markdown / docs siteHeader-based splitter
PDFs with structureHeader-based after parsing
Legal / academic papersSemantic chunking
Long books with chaptersParent-child
Code filesLanguage-specific splitters

Common Mistakes#

MistakeFix
Too-small chunks (< 100 tokens)Missing context, can’t answer questions alone
No overlapAnswers split across chunk boundaries get lost
Ignoring metadataAlways store source, page number, section header
Chunking before cleaningRemove headers/footers/page numbers first

Adding Metadata to Chunks#

Always attach metadata — you’ll need it for citations later.

from langchain.schema import Document

def chunk_with_metadata(filepath: str, chunks: list[str]) -> list[Document]:
    docs = []
    for i, chunk in enumerate(chunks):
        docs.append(Document(
            page_content=chunk,
            metadata={
                "source": filepath,
                "chunk_id": i,
                "total_chunks": len(chunks),
            }
        ))
    return docs

Quick Reference#

# Install everything you need
uv add langchain langchain-openai langchain-experimental tiktoken nltk spacy

# The go-to splitter for most use cases
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,    # tokens (or characters if you don't use tiktoken)
    chunk_overlap=64,
)

Further Reading#