Document Chunking for RAG (Recursive, Semantic, Sentence-level)
Chunking is breaking documents into fragments for indexing in a vector database. The size and boundaries of chunks critically affect RAG quality: fragments that are too small lose context, those that are too large reduce search accuracy and exceed the model's context window.
Chunking Strategies
Fixed-size chunking — the simplest, the worst:
def fixed_size_chunk(text: str, chunk_size: int = 500,
overlap: int = 50) -> list[str]:
tokens = text.split() # Simplified
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk = ' '.join(tokens[i:i + chunk_size])
chunks.append(chunk)
return chunks
Problem: cuts sentences and paragraphs in the middle.
Recursive character text splitter (LangChain) — splits by hierarchy of delimiters:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # ~250 words
chunk_overlap=200, # 50-word overlap
separators=[
"\n\n", # Paragraphs (priority)
"\n", # Lines
". ", # Sentences
", ", # Parts of sentences
" ", # Words (last resort)
"" # Characters
]
)
chunks = splitter.create_documents(
texts=[document_text],
metadatas=[{"source": "document.pdf", "page": 1}]
)
Semantic chunking — splitting by semantic boundaries:
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticChunker:
def __init__(self, model_name: str = 'all-MiniLM-L6-v2',
threshold: float = 0.7):
self.model = SentenceTransformer(model_name)
self.threshold = threshold
def chunk(self, text: str) -> list[str]:
# Split into sentences
sentences = self._split_into_sentences(text)
if len(sentences) < 2:
return [text]
# Sentence embeddings
embeddings = self.model.encode(sentences)
# Finding semantic breaks
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Cosine similarity of adjacent sentences
sim = np.dot(embeddings[i], embeddings[i-1]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1])
)
if sim < self.threshold:
# Semantic break — create new chunk
chunks.append(' '.join(current_chunk))
current_chunk = []
current_chunk.append(sentences[i])
if current_chunk:
chunks.append(' '.join(current_chunk))
# Merge too small chunks
return self._merge_small_chunks(chunks, min_words=50)
Document structure-aware chunking — preserving document hierarchy:
class StructureAwareChunker:
def chunk_markdown(self, text: str, max_chunk_tokens: int = 300) -> list[dict]:
"""Splitting while respecting Markdown headers"""
sections = re.split(r'\n(#{1,3}\s+.+)', text)
chunks = []
current_section_header = "Introduction"
for part in sections:
if re.match(r'#{1,3}\s+', part):
current_section_header = part.strip()
else:
# Split section into sub-chunks if large
sub_chunks = self._split_section(part, max_chunk_tokens)
for sub_chunk in sub_chunks:
if sub_chunk.strip():
chunks.append({
'text': sub_chunk,
'section': current_section_header,
# Breadcrumb for attribution
'breadcrumb': current_section_header
})
return chunks
Optimal Chunking Parameters by Content Type
| Document Type | Strategy | Chunk size | Overlap |
|---|---|---|---|
| Technical documentation | Structural | 500-1000 | 100-200 |
| Scientific papers | Semantic | 800-1500 | 150-300 |
| FAQ / Q&A | By questions | 100-300 | 0 |
| Code | By functions | Variable | 0 |
| News/blogs | Recursive | 400-800 | 80-150 |
| Chats | By sessions | 300-700 | 50 |
Chunk Metadata and Parent-Child Indexing
Small-to-big retrieval — index small chunks for accurate search, but pass large parent chunks to context:
class ParentChildIndexer:
def index(self, document: str) -> list[dict]:
# Parent chunks (large, for context)
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000, chunk_overlap=200
)
parents = parent_splitter.split_text(document)
all_chunks = []
for p_idx, parent in enumerate(parents):
# Child chunks (small, for search)
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=300, chunk_overlap=50
)
children = child_splitter.split_text(parent)
for child in children:
all_chunks.append({
'child_text': child, # For embedding and search
'parent_text': parent, # For passing to LLM
'parent_idx': p_idx
})
return all_chunks
Correct choice of chunking strategy improves relevance retrieval by 15-30% compared to naive fixed-size approach.







