Braintrust Integration for LLM Quality Evaluation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1All 1566 services
Braintrust Integration for LLM Quality Evaluation
Simple
from 4 hours to 2 days
Frequently Asked Questions

AI Development Areas

AI Solution Development Stages

Latest works

  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1196
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1119
  • image_logo-advance_0.webp
    B2B Advance company logo design
    586
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    853
  • image_logo-aider_0.webp
    AIDER company logo development
    783
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    900

Braintrust Integration for LLM Quality Assessment

Braintrust is a platform for evaluation and CI/CD testing of LLM applications. It allows you to create sets of test cases, run them automatically when prompts or models change, and monitor for regressions.

Installation and first experiment

pip install braintrust

import braintrust
from braintrust import Eval

braintrust.login(api_key="...")

# Определение оценочной функции
def accuracy_scorer(output: str, expected: str) -> float:
    """Простой scorer на основе точного совпадения"""
    return 1.0 if output.strip().lower() == expected.strip().lower() else 0.0

def llm_judge_scorer(input: str, output: str) -> float:
    """LLM-as-judge для субъективных задач"""
    from openai import OpenAI
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Rate this response quality from 0 to 1.
Query: {input}
Response: {output}
Return only a decimal number."""
        }]
    )
    return float(response.choices[0].message.content.strip())

# Запуск эксперимента
Eval(
    "customer-support-bot",  # Имя проекта в Braintrust
    data=lambda: [
        {"input": q, "expected": a}
        for q, a in test_dataset
    ],
    task=lambda input: call_customer_support_bot(input),
    scores=[accuracy_scorer, llm_judge_scorer],
    experiment_name="prompt-v3-gpt4o"
)

Integration into CI/CD

# GitHub Actions
- name: Run LLM Evaluation
  run: |
    pip install braintrust
    python eval/run_evals.py
  env:
    BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

# Автоматическое сравнение с baseline
- name: Check for regressions
  run: |
    braintrust eval --project customer-support \
      --compare-to baseline \
      --fail-on-regression 0.05  # Fail если score упал на 5%+

Braintrust automatically compares the results of the current experiment with the previous one, highlighting regressions (examples where the new prompt is worse) and improvements. This makes it especially valuable for teams with fast prompt development cycles.