LlamaIndex Integration for RAG Systems
LlamaIndex (formerly GPT Index) is a framework specialized for RAG: document loading, indexing, and querying. Unlike LangChain's universality, LlamaIndex focuses on data ingestion, advanced retrieval, and query understanding. It stands out with a rich ecosystem of loaders (150+ sources) and advanced indexing strategies.
Basic RAG with LlamaIndex
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Global settings configuration
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
# Load documents
documents = SimpleDirectoryReader("./data", recursive=True).load_data()
# Create index
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What is the warranty period for the equipment?")
print(response)
# Access sources
for node in response.source_nodes:
print(f"Score: {node.score:.3f}, Source: {node.metadata.get('file_name')}")
Vector Store Integration
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext
import qdrant_client
# Connect to Qdrant
client = qdrant_client.QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(client=client, collection_name="docs")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Index into Qdrant
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
show_progress=True,
)
# Reload existing index
index = VectorStoreIndex.from_vector_store(vector_store)
SubQuestionQueryEngine: Breaking Down Complex Questions
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool
# Create tools from different sources
financial_tool = QueryEngineTool.from_defaults(
query_engine=financial_index.as_query_engine(),
name="financial_data",
description="Company financial metrics for 2023–2025",
)
contracts_tool = QueryEngineTool.from_defaults(
query_engine=contracts_index.as_query_engine(),
name="contracts",
description="Contracts with suppliers and clients",
)
# SubQuestion engine automatically breaks down queries into sub-queries
engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=[financial_tool, contracts_tool],
use_async=True,
)
response = engine.query(
"Compare Q1 2025 revenue with budget and check for overdue payments in contracts"
)
# Agent creates 2 sub-queries and combines results
RouterQueryEngine: Index-Based Routing
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
router_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[
QueryEngineTool.from_defaults(
query_engine=summary_index.as_query_engine(response_mode="tree_summarize"),
description="For summary questions about the document as a whole",
),
QueryEngineTool.from_defaults(
query_engine=vector_index.as_query_engine(),
description="For searching specific facts and details",
),
],
)
IngestionPipeline: Advanced Preprocessing
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser
from llama_index.core.extractors import TitleExtractor, QuestionsAnsweredExtractor
from llama_index.core.vector_stores import SimpleVectorStore
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512, chunk_overlap=64),
TitleExtractor(nodes=3), # Adds document title to metadata of each chunk
QuestionsAnsweredExtractor(questions=5), # Generates hypothetical questions for HyDE
OpenAIEmbedding(model="text-embedding-3-small"),
],
vector_store=vector_store,
cache=IngestionCache(), # Caches processed documents
)
nodes = await pipeline.arun(documents=documents, show_progress=True)
Retrieval-Augmented Fine-Tuning: Embedding Adaptation
from llama_index.finetuning import EmbeddingAdapterFinetuneEngine, generate_qa_embedding_pairs
# Generate training pairs from documents
train_dataset = generate_qa_embedding_pairs(
nodes=train_nodes,
llm=OpenAI(model="gpt-4o-mini"),
)
# Fine-tune embedding adapter
finetune_engine = EmbeddingAdapterFinetuneEngine(
train_dataset,
base_embed_model=OpenAIEmbedding(),
model_output_path="embed_adapter",
epochs=4,
batch_size=8,
)
finetune_engine.finetune()
# Apply adapter
from llama_index.embeddings.adapter import AdapterEmbeddingModel
adapted_embed = AdapterEmbeddingModel(
OpenAIEmbedding(),
adapter_path="embed_adapter",
)
Practical Case Study: Corporate Knowledge Base for an Insurance Company
Initial Situation: 15,000 pages of documents (policies, insurance rules, regulatory instructions, internal procedures). Operators spent 8–12 minutes searching for answers to customer questions.
LlamaIndex Architecture:
- Sources: 4 document types in separate Qdrant indexes
- RouterQueryEngine: routing by question type
- SubQuestionQueryEngine: for questions spanning multiple types
- IngestionPipeline: automatic re-indexing when documents update
- Metadata filtering: by insurance type, document date, regional regulator
Results:
- Average operator response time: 10 min → 1.5 min
- Answer accuracy (expert evaluation): 91%
- Incorrect references to outdated policy editions: ~8% → 0.4%
- Document coverage: 73% (operators previously unaware of many documents)
LlamaIndex vs LangChain for RAG
| Aspect | LlamaIndex | LangChain |
|---|---|---|
| Specialization | RAG, document QA | Universal LLM applications |
| Data loaders | 150+ native | Through community |
| Advanced retrieval | SubQuestion, Router built-in | Requires customization |
| Agent capabilities | Available (LlamaAgents) | More mature (LangGraph) |
| Ecosystem | LlamaHub | LangChain Hub |
Timeline
- Basic RAG on LlamaIndex: 3–5 days
- Multi-source RAG with RouterQueryEngine: 1–2 weeks
- IngestionPipeline with auto-updates: 1 week
- Embedding fine-tuning for domain: 2–3 weeks







