Evals

Evaluate agent performance with built-in and custom scorers.

Topics

Overview - Introduction to evals
Built-in Scorers - Pre-built evaluation metrics
Custom Scorers - Build your own evaluators
Running in CI - Integrate evals in CI/CD
Datasets - Test datasets for evaluation

Evals Overview

Evals provide systematic evaluation of agent performance.

Why Evals?

Measure agent quality
Catch regressions
Compare versions
Optimize prompts

Basic Usage

import { runEvals } from '@mastra/evals';

const results = await runEvals({
  agent: myAgent,
  dataset: testDataset,
  scorers: [relevance, coherence, accuracy],
});

Results

interface EvalResult {
  scorer: string;
  score: number;
  passed: boolean;
  details: Record<string, unknown>;
}

// results = [
//   { scorer: 'relevance', score: 0.95, passed: true, details: {...} },
//   { scorer: 'coherence', score: 0.88, passed: true, details: {...} },
// ]

Integration

// In your CI pipeline
const results = await runEvals(config);

if (results.some(r => !r.passed)) {
  process.exit(1);
}

Built-in Scorers

Pre-built evaluation metrics.

Available Scorers

Answer Relevancy

Measures how relevant the answer is to the question.

import { answerRelevancy } from '@mastra/evals';

const score = await answerRelevancy({
  question: 'What is AI?',
  answer: 'AI stands for Artificial Intelligence...',
});

Faithfulness

Measures factual accuracy against provided context.

import { faithfulness } from '@mastra/evals';

const score = await faithfulness({
  context: 'The sky is blue.',
  answer: 'The sky appears blue due to Rayleigh scattering.',
});

Context Relevance

Measures how relevant the retrieved context is.

import { contextRelevance } from '@mastra/evals';

const score = await contextRelevance({
  question: 'What is photosynthesis?',
  context: 'Photosynthesis is the process by which plants...',
});

Hallucination

Detects potential hallucinations in responses.

import { hallucination } from '@mastra/evals';

const score = await hallucination({
  context: 'The earth orbits the sun.',
  answer: 'The earth orbits the sun once per year.',
});

Toxicity

Measures toxicity levels in responses.

import { toxicity } from '@mastra/evals';

const score = await toxicity({
  text: 'This is a normal response.',
});

Combining Scorers

const results = await runEvals({
  agent,
  dataset,
  scorers: [
    answerRelevancy,
    faithfulness,
    contextRelevance,
    toxicity,
  ],
});

Custom Scorers

Build your own evaluation metrics.

Creating a Scorer

import { createScorer, Scorer } from '@mastra/evals';

const myScorer = createScorer({
  name: 'my-custom-scorer',
  description: 'Measures custom quality metric',
  evaluate: async ({ input, output, expected }) => {
    // Custom evaluation logic
    const score = calculateScore(output);
    return {
      score,
      passed: score >= 0.8,
      details: { reason: 'Custom evaluation' },
    };
  },
});

Using Custom Scorers

const results = await runEvals({
  agent: myAgent,
  dataset: testDataset,
  scorers: [myScorer, answerRelevancy],
});

Async Scorers

const apiScorer = createScorer({
  name: 'api-check',
  evaluate: async ({ output }) => {
    const result = await callExternalAPI(output);
    return {
      score: result.score,
      passed: result.score > 0.5,
    };
  },
});

Best Practices

Return consistent scores (0-1)
Provide meaningful details
Handle edge cases
Make scorers deterministic

Running in CI

Integrate evals into your CI/CD pipeline.

GitHub Actions

name: Evals
on: [push]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm install
      - run: npm run test:evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

CI Script

// scripts/run-evals.ts
import { runEvals } from '@mastra/evals';

async function main() {
  const results = await runEvals({
    agent: mastra.getAgent('myAgent'),
    dataset: datasets.qualityTests,
    scorers: [
      scorers.answerRelevancy,
      scorers.faithfulness,
      scorers.toxicity,
    ],
  });

  // Fail if any scorer fails
  const failed = results.filter(r => !r.passed);
  if (failed.length > 0) {
    console.error('Eval failures:', failed);
    process.exit(1);
  }
}

main();

Threshold Configuration

const results = await runEvals({
  agent,
  dataset,
  scorers: [answerRelevancy],
  thresholds: {
    answerRelevancy: 0.85, // Minimum score
  },
});

Reporting

// Generate report
const report = await generateEvalReport(results);
console.log(report.summary);
console.log(report.details);

Eval Datasets

Test datasets for evaluation.

Overview

Datasets contain test cases with inputs and expected outputs.

Creating a Dataset

import { Dataset } from '@mastra/evals';

const testDataset = new Dataset({
  name: 'customer-support-qa',
  items: [
    {
      id: 'q1',
      input: { question: 'How do I reset my password?' },
      expected: 'Instructions for password reset',
    },
    {
      id: 'q2',
      input: { question: 'What is your refund policy?' },
      expected: 'Refund policy information',
    },
  ],
});

Loading from JSON

const dataset = Dataset.fromJSON('./test-data/qa-tests.json');

Dataset Format

{
  "name": "test-dataset",
  "items": [
    {
      "id": "test-1",
      "input": { "text": "Hello" },
      "expected": "Hi there!"
    }
  ]
}

Using Datasets

const results = await runEvals({
  agent: myAgent,
  dataset: testDataset,
  scorers: [myScorer],
});

Running Experiments

Compare agent versions with experiments.

Creating Experiments

import { Experiment } from '@mastra/evals';

const experiment = new Experiment({
  name: 'prompt-v1-vs-v2',
  agent: myAgent,
  dataset: testDataset,
  variants: [
    { name: 'v1', config: { prompt: 'v1 prompt' } },
    { name: 'v2', config: { prompt: 'v2 prompt' } },
  ],
});

Running an Experiment

const results = await experiment.run();

console.log(results.v1.score); // 0.85
console.log(results.v2.score); // 0.92

Statistical Significance

import { analyzeSignificance } from '@mastra/evals';

const analysis = analyzeSignificance(results.v1, results.v2);
console.log(analysis.significant); // true
console.log(analysis.pValue); // 0.023

CI Integration

// Compare against baseline
const experiment = new Experiment({
  name: 'regression-check',
  agent: currentAgent,
  dataset,
  baseline: previousAgent,
});

const results = await experiment.run();
if (results.regressionDetected) {
  console.error('Performance regression detected!');
  process.exit(1);
}