Document Chunking for RAG (Recursive, Semantic, Sentence-level)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Document Chunking for RAG (Recursive, Semantic, Sentence-level)
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

Document Chunking for RAG (Recursive, Semantic, Sentence-level)

Chunking is breaking documents into fragments for indexing in a vector database. The size and boundaries of chunks critically affect RAG quality: fragments that are too small lose context, those that are too large reduce search accuracy and exceed the model's context window.

Chunking Strategies

Fixed-size chunking — the simplest, the worst:

def fixed_size_chunk(text: str, chunk_size: int = 500,
                     overlap: int = 50) -> list[str]:
    tokens = text.split()  # Simplified
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = ' '.join(tokens[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

Problem: cuts sentences and paragraphs in the middle.

Recursive character text splitter (LangChain) — splits by hierarchy of delimiters:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # ~250 words
    chunk_overlap=200,     # 50-word overlap
    separators=[
        "\n\n",  # Paragraphs (priority)
        "\n",    # Lines
        ". ",    # Sentences
        ", ",    # Parts of sentences
        " ",     # Words (last resort)
        ""       # Characters
    ]
)

chunks = splitter.create_documents(
    texts=[document_text],
    metadatas=[{"source": "document.pdf", "page": 1}]
)

Semantic chunking — splitting by semantic boundaries:

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticChunker:
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2',
                 threshold: float = 0.7):
        self.model = SentenceTransformer(model_name)
        self.threshold = threshold

    def chunk(self, text: str) -> list[str]:
        # Split into sentences
        sentences = self._split_into_sentences(text)
        if len(sentences) < 2:
            return [text]

        # Sentence embeddings
        embeddings = self.model.encode(sentences)

        # Finding semantic breaks
        chunks = []
        current_chunk = [sentences[0]]

        for i in range(1, len(sentences)):
            # Cosine similarity of adjacent sentences
            sim = np.dot(embeddings[i], embeddings[i-1]) / (
                np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1])
            )

            if sim < self.threshold:
                # Semantic break — create new chunk
                chunks.append(' '.join(current_chunk))
                current_chunk = []

            current_chunk.append(sentences[i])

        if current_chunk:
            chunks.append(' '.join(current_chunk))

        # Merge too small chunks
        return self._merge_small_chunks(chunks, min_words=50)

Document structure-aware chunking — preserving document hierarchy:

class StructureAwareChunker:
    def chunk_markdown(self, text: str, max_chunk_tokens: int = 300) -> list[dict]:
        """Splitting while respecting Markdown headers"""
        sections = re.split(r'\n(#{1,3}\s+.+)', text)
        chunks = []
        current_section_header = "Introduction"

        for part in sections:
            if re.match(r'#{1,3}\s+', part):
                current_section_header = part.strip()
            else:
                # Split section into sub-chunks if large
                sub_chunks = self._split_section(part, max_chunk_tokens)
                for sub_chunk in sub_chunks:
                    if sub_chunk.strip():
                        chunks.append({
                            'text': sub_chunk,
                            'section': current_section_header,
                            # Breadcrumb for attribution
                            'breadcrumb': current_section_header
                        })

        return chunks

Optimal Chunking Parameters by Content Type

Document Type Strategy Chunk size Overlap
Technical documentation Structural 500-1000 100-200
Scientific papers Semantic 800-1500 150-300
FAQ / Q&A By questions 100-300 0
Code By functions Variable 0
News/blogs Recursive 400-800 80-150
Chats By sessions 300-700 50

Chunk Metadata and Parent-Child Indexing

Small-to-big retrieval — index small chunks for accurate search, but pass large parent chunks to context:

class ParentChildIndexer:
    def index(self, document: str) -> list[dict]:
        # Parent chunks (large, for context)
        parent_splitter = RecursiveCharacterTextSplitter(
            chunk_size=2000, chunk_overlap=200
        )
        parents = parent_splitter.split_text(document)

        all_chunks = []
        for p_idx, parent in enumerate(parents):
            # Child chunks (small, for search)
            child_splitter = RecursiveCharacterTextSplitter(
                chunk_size=300, chunk_overlap=50
            )
            children = child_splitter.split_text(parent)

            for child in children:
                all_chunks.append({
                    'child_text': child,       # For embedding and search
                    'parent_text': parent,     # For passing to LLM
                    'parent_idx': p_idx
                })

        return all_chunks

Correct choice of chunking strategy improves relevance retrieval by 15-30% compared to naive fixed-size approach.