RAG System Quality Evaluation (RAGAS, precision, recall, faithfulness)
Without systematic quality evaluation, a RAG system is a "black box". RAGAS (RAG Assessment) is the most popular framework for automatic evaluation that doesn't require manual answer annotation, using LLM as a judge.
RAGAS Metrics
| Metric | What it measures | Range |
|---|---|---|
| Context Precision | Fraction of retrieved context actually needed for the answer | 0–1 |
| Context Recall | Fraction of necessary context that was retrieved | 0–1 |
| Faithfulness | Answer correspondence to retrieved context (no hallucinations) | 0–1 |
| Answer Relevancy | How relevant the answer is to the question | 0–1 |
| Answer Correctness | Factual correctness of the answer (requires ground truth) | 0–1 |
RAGAS Installation and Basic Usage
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy,
answer_correctness,
)
from datasets import Dataset
# Prepare dataset for evaluation
eval_data = {
"question": [
"What is the contract validity period?",
"Who is responsible for delivery delays?",
],
"answer": [
"The contract is valid until December 31, 2025.",
"The supplier is responsible for delays exceeding 5 business days.",
],
"contexts": [
["2.1. This Agreement becomes effective from the date of signing and is valid until 12/31/2025..."],
["4.3. In case of delivery delay exceeding 5 business days, the Supplier..."],
],
"ground_truths": [
"The contract is valid until December 31, 2025.",
"The Supplier is responsible for delays exceeding 5 business days.",
],
}
dataset = Dataset.from_dict(eval_data)
# Evaluation
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
results = evaluate(
dataset,
metrics=[context_precision, context_recall, faithfulness, answer_relevancy],
llm=evaluator_llm,
embeddings=evaluator_embeddings,
)
print(results)
# {'context_precision': 0.88, 'context_recall': 0.82, 'faithfulness': 0.94, 'answer_relevancy': 0.91}
Automated Test Set: Testset Generation
RAGAS can generate test sets from your documents:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
generator = TestsetGenerator.with_openai()
# Generate tests of varying complexity
testset = generator.generate_with_langchain_docs(
documents=your_documents,
test_size=100,
distributions={
simple: 0.5, # Simple single-document questions
reasoning: 0.3, # Reasoning-based questions
multi_context: 0.2, # Multi-document questions
}
)
testset.to_pandas().to_csv("evaluation_testset.csv", index=False)
Metric Interpretation
Context Precision < 0.7: System retrieves too much irrelevant context. Solutions: improve reranking, add metadata filtering, reduce top_k.
Context Recall < 0.7: System fails to find needed documents. Solutions: improve chunking, try hybrid search, fine-tune embedding model.
Faithfulness < 0.8: Model hallucinating — inventing information unsupported by context. Solutions: improve system prompt, add instruction "answer only based on context", use lower temperature.
Answer Relevancy < 0.8: Answers are off-topic. Solutions: improve prompt, add desired format examples.
Practical Case: Iterations by RAGAS Metrics
Initial state (basic RAG, GPT-4o-mini, ChromaDB):
| Metric | v1 |
|---|---|
| Context Precision | 0.61 |
| Context Recall | 0.68 |
| Faithfulness | 0.74 |
| Answer Relevancy | 0.79 |
Iteration 1: Added hybrid search (sparse + dense).
- Context Recall: 0.68 → 0.81 (+19%)
Iteration 2: Added Contextual Compression + reranker.
- Context Precision: 0.61 → 0.84 (+38%)
- Faithfulness: 0.74 → 0.91 (+23%)
Iteration 3: Refined system prompt with explicit hallucination prevention.
- Faithfulness: 0.91 → 0.95
- Answer Relevancy: 0.79 → 0.88
Final state (v4):
| Metric | v4 |
|---|---|
| Context Precision | 0.84 |
| Context Recall | 0.81 |
| Faithfulness | 0.95 |
| Answer Relevancy | 0.88 |
Continuous Evaluation in CI/CD
import pytest
@pytest.fixture(scope="session")
def rag_evaluation_results():
"""Run RAGAS evaluation on test set"""
return evaluate(evaluation_dataset, metrics=[faithfulness, context_recall])
def test_faithfulness_above_threshold(rag_evaluation_results):
assert rag_evaluation_results["faithfulness"] >= 0.85, \
f"Faithfulness {rag_evaluation_results['faithfulness']:.2f} below threshold 0.85"
def test_context_recall_above_threshold(rag_evaluation_results):
assert rag_evaluation_results["context_recall"] >= 0.75
Timeline
- RAGAS pipeline setup: 2–3 days
- Test set generation: 1–2 days
- Baseline evaluation: 1 day
- Improvement iterations: 2–4 weeks
- Total: 3–6 weeks







