RAG System Quality Evaluation (RAGAS, Precision, Recall, Faithfulness)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
RAG System Quality Evaluation (RAGAS, Precision, Recall, Faithfulness)
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

RAG System Quality Evaluation (RAGAS, precision, recall, faithfulness)

Without systematic quality evaluation, a RAG system is a "black box". RAGAS (RAG Assessment) is the most popular framework for automatic evaluation that doesn't require manual answer annotation, using LLM as a judge.

RAGAS Metrics

Metric What it measures Range
Context Precision Fraction of retrieved context actually needed for the answer 0–1
Context Recall Fraction of necessary context that was retrieved 0–1
Faithfulness Answer correspondence to retrieved context (no hallucinations) 0–1
Answer Relevancy How relevant the answer is to the question 0–1
Answer Correctness Factual correctness of the answer (requires ground truth) 0–1

RAGAS Installation and Basic Usage

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
    answer_correctness,
)
from datasets import Dataset

# Prepare dataset for evaluation
eval_data = {
    "question": [
        "What is the contract validity period?",
        "Who is responsible for delivery delays?",
    ],
    "answer": [
        "The contract is valid until December 31, 2025.",
        "The supplier is responsible for delays exceeding 5 business days.",
    ],
    "contexts": [
        ["2.1. This Agreement becomes effective from the date of signing and is valid until 12/31/2025..."],
        ["4.3. In case of delivery delay exceeding 5 business days, the Supplier..."],
    ],
    "ground_truths": [
        "The contract is valid until December 31, 2025.",
        "The Supplier is responsible for delays exceeding 5 business days.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Evaluation
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

results = evaluate(
    dataset,
    metrics=[context_precision, context_recall, faithfulness, answer_relevancy],
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
)

print(results)
# {'context_precision': 0.88, 'context_recall': 0.82, 'faithfulness': 0.94, 'answer_relevancy': 0.91}

Automated Test Set: Testset Generation

RAGAS can generate test sets from your documents:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

# Generate tests of varying complexity
testset = generator.generate_with_langchain_docs(
    documents=your_documents,
    test_size=100,
    distributions={
        simple: 0.5,          # Simple single-document questions
        reasoning: 0.3,       # Reasoning-based questions
        multi_context: 0.2,   # Multi-document questions
    }
)

testset.to_pandas().to_csv("evaluation_testset.csv", index=False)

Metric Interpretation

Context Precision < 0.7: System retrieves too much irrelevant context. Solutions: improve reranking, add metadata filtering, reduce top_k.

Context Recall < 0.7: System fails to find needed documents. Solutions: improve chunking, try hybrid search, fine-tune embedding model.

Faithfulness < 0.8: Model hallucinating — inventing information unsupported by context. Solutions: improve system prompt, add instruction "answer only based on context", use lower temperature.

Answer Relevancy < 0.8: Answers are off-topic. Solutions: improve prompt, add desired format examples.

Practical Case: Iterations by RAGAS Metrics

Initial state (basic RAG, GPT-4o-mini, ChromaDB):

Metric v1
Context Precision 0.61
Context Recall 0.68
Faithfulness 0.74
Answer Relevancy 0.79

Iteration 1: Added hybrid search (sparse + dense).

  • Context Recall: 0.68 → 0.81 (+19%)

Iteration 2: Added Contextual Compression + reranker.

  • Context Precision: 0.61 → 0.84 (+38%)
  • Faithfulness: 0.74 → 0.91 (+23%)

Iteration 3: Refined system prompt with explicit hallucination prevention.

  • Faithfulness: 0.91 → 0.95
  • Answer Relevancy: 0.79 → 0.88

Final state (v4):

Metric v4
Context Precision 0.84
Context Recall 0.81
Faithfulness 0.95
Answer Relevancy 0.88

Continuous Evaluation in CI/CD

import pytest

@pytest.fixture(scope="session")
def rag_evaluation_results():
    """Run RAGAS evaluation on test set"""
    return evaluate(evaluation_dataset, metrics=[faithfulness, context_recall])

def test_faithfulness_above_threshold(rag_evaluation_results):
    assert rag_evaluation_results["faithfulness"] >= 0.85, \
        f"Faithfulness {rag_evaluation_results['faithfulness']:.2f} below threshold 0.85"

def test_context_recall_above_threshold(rag_evaluation_results):
    assert rag_evaluation_results["context_recall"] >= 0.75

Timeline

  • RAGAS pipeline setup: 2–3 days
  • Test set generation: 1–2 days
  • Baseline evaluation: 1 day
  • Improvement iterations: 2–4 weeks
  • Total: 3–6 weeks