Braintrust Integration for LLM Quality Evaluation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1566 services

Braintrust Integration for LLM Quality Evaluation

Simple

from 4 hours to 2 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

Development of a web application for FEEDME
1196
Development of an online store for the company FURNORO
1119
B2B Advance company logo design
586
Development of a web application for Enviok
853
AIDER company logo development
783
CRM development for Chasseurs
900

Show more works

Braintrust Integration for LLM Quality Assessment

Braintrust is a platform for evaluation and CI/CD testing of LLM applications. It allows you to create sets of test cases, run them automatically when prompts or models change, and monitor for regressions.

Installation and first experiment

pip install braintrust

import braintrust
from braintrust import Eval

braintrust.login(api_key="...")

# Определение оценочной функции
def accuracy_scorer(output: str, expected: str) -> float:
    """Простой scorer на основе точного совпадения"""
    return 1.0 if output.strip().lower() == expected.strip().lower() else 0.0

def llm_judge_scorer(input: str, output: str) -> float:
    """LLM-as-judge для субъективных задач"""
    from openai import OpenAI
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Rate this response quality from 0 to 1.
Query: {input}
Response: {output}
Return only a decimal number."""
        }]
    )
    return float(response.choices[0].message.content.strip())

# Запуск эксперимента
Eval(
    "customer-support-bot",  # Имя проекта в Braintrust
    data=lambda: [
        {"input": q, "expected": a}
        for q, a in test_dataset
    ],
    task=lambda input: call_customer_support_bot(input),
    scores=[accuracy_scorer, llm_judge_scorer],
    experiment_name="prompt-v3-gpt4o"
)

Integration into CI/CD

# GitHub Actions
- name: Run LLM Evaluation
  run: |
    pip install braintrust
    python eval/run_evals.py
  env:
    BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

# Автоматическое сравнение с baseline
- name: Check for regressions
  run: |
    braintrust eval --project customer-support \
      --compare-to baseline \
      --fail-on-regression 0.05  # Fail если score упал на 5%+

Braintrust automatically compares the results of the current experiment with the previous one, highlighting regressions (examples where the new prompt is worse) and improvements. This makes it especially valuable for teams with fast prompt development cycles.