mastra/Core Concepts
Evals
Evaluate agent performance with built-in and custom scorers
evalsevaluationtestingscorers
Evals
Evaluate agent performance with built-in and custom scorers.
Topics
- Overview - Introduction to evals
- Built-in Scorers - Pre-built evaluation metrics
- Custom Scorers - Build your own evaluators
- Running in CI - Integrate evals in CI/CD
- Datasets - Test datasets for evaluation
Evals Overview
Evals provide systematic evaluation of agent performance.
Why Evals?
- Measure agent quality
- Catch regressions
- Compare versions
- Optimize prompts
Basic Usage
import { runEvals } from '@mastra/evals';
const results = await runEvals({
agent: myAgent,
dataset: testDataset,
scorers: [relevance, coherence, accuracy],
});
Results
interface EvalResult {
scorer: string;
score: number;
passed: boolean;
details: Record<string, unknown>;
}
// results = [
// { scorer: 'relevance', score: 0.95, passed: true, details: {...} },
// { scorer: 'coherence', score: 0.88, passed: true, details: {...} },
// ]
Integration
// In your CI pipeline
const results = await runEvals(config);
if (results.some(r => !r.passed)) {
process.exit(1);
}
Built-in Scorers
Pre-built evaluation metrics.
Available Scorers
Answer Relevancy
Measures how relevant the answer is to the question.
import { answerRelevancy } from '@mastra/evals';
const score = await answerRelevancy({
question: 'What is AI?',
answer: 'AI stands for Artificial Intelligence...',
});
Faithfulness
Measures factual accuracy against provided context.
import { faithfulness } from '@mastra/evals';
const score = await faithfulness({
context: 'The sky is blue.',
answer: 'The sky appears blue due to Rayleigh scattering.',
});
Context Relevance
Measures how relevant the retrieved context is.
import { contextRelevance } from '@mastra/evals';
const score = await contextRelevance({
question: 'What is photosynthesis?',
context: 'Photosynthesis is the process by which plants...',
});
Hallucination
Detects potential hallucinations in responses.
import { hallucination } from '@mastra/evals';
const score = await hallucination({
context: 'The earth orbits the sun.',
answer: 'The earth orbits the sun once per year.',
});
Toxicity
Measures toxicity levels in responses.
import { toxicity } from '@mastra/evals';
const score = await toxicity({
text: 'This is a normal response.',
});
Combining Scorers
const results = await runEvals({
agent,
dataset,
scorers: [
answerRelevancy,
faithfulness,
contextRelevance,
toxicity,
],
});
Custom Scorers
Build your own evaluation metrics.
Creating a Scorer
import { createScorer, Scorer } from '@mastra/evals';
const myScorer = createScorer({
name: 'my-custom-scorer',
description: 'Measures custom quality metric',
evaluate: async ({ input, output, expected }) => {
// Custom evaluation logic
const score = calculateScore(output);
return {
score,
passed: score >= 0.8,
details: { reason: 'Custom evaluation' },
};
},
});
Using Custom Scorers
const results = await runEvals({
agent: myAgent,
dataset: testDataset,
scorers: [myScorer, answerRelevancy],
});
Async Scorers
const apiScorer = createScorer({
name: 'api-check',
evaluate: async ({ output }) => {
const result = await callExternalAPI(output);
return {
score: result.score,
passed: result.score > 0.5,
};
},
});
Best Practices
- Return consistent scores (0-1)
- Provide meaningful details
- Handle edge cases
- Make scorers deterministic
Running in CI
Integrate evals into your CI/CD pipeline.
GitHub Actions
name: Evals
on: [push]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm install
- run: npm run test:evals
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
CI Script
// scripts/run-evals.ts
import { runEvals } from '@mastra/evals';
async function main() {
const results = await runEvals({
agent: mastra.getAgent('myAgent'),
dataset: datasets.qualityTests,
scorers: [
scorers.answerRelevancy,
scorers.faithfulness,
scorers.toxicity,
],
});
// Fail if any scorer fails
const failed = results.filter(r => !r.passed);
if (failed.length > 0) {
console.error('Eval failures:', failed);
process.exit(1);
}
}
main();
Threshold Configuration
const results = await runEvals({
agent,
dataset,
scorers: [answerRelevancy],
thresholds: {
answerRelevancy: 0.85, // Minimum score
},
});
Reporting
// Generate report
const report = await generateEvalReport(results);
console.log(report.summary);
console.log(report.details);
Eval Datasets
Test datasets for evaluation.
Overview
Datasets contain test cases with inputs and expected outputs.
Creating a Dataset
import { Dataset } from '@mastra/evals';
const testDataset = new Dataset({
name: 'customer-support-qa',
items: [
{
id: 'q1',
input: { question: 'How do I reset my password?' },
expected: 'Instructions for password reset',
},
{
id: 'q2',
input: { question: 'What is your refund policy?' },
expected: 'Refund policy information',
},
],
});
Loading from JSON
const dataset = Dataset.fromJSON('./test-data/qa-tests.json');
Dataset Format
{
"name": "test-dataset",
"items": [
{
"id": "test-1",
"input": { "text": "Hello" },
"expected": "Hi there!"
}
]
}
Using Datasets
const results = await runEvals({
agent: myAgent,
dataset: testDataset,
scorers: [myScorer],
});
Running Experiments
Compare agent versions with experiments.
Creating Experiments
import { Experiment } from '@mastra/evals';
const experiment = new Experiment({
name: 'prompt-v1-vs-v2',
agent: myAgent,
dataset: testDataset,
variants: [
{ name: 'v1', config: { prompt: 'v1 prompt' } },
{ name: 'v2', config: { prompt: 'v2 prompt' } },
],
});
Running an Experiment
const results = await experiment.run();
console.log(results.v1.score); // 0.85
console.log(results.v2.score); // 0.92
Statistical Significance
import { analyzeSignificance } from '@mastra/evals';
const analysis = analyzeSignificance(results.v1, results.v2);
console.log(analysis.significant); // true
console.log(analysis.pValue); // 0.023
CI Integration
// Compare against baseline
const experiment = new Experiment({
name: 'regression-check',
agent: currentAgent,
dataset,
baseline: previousAgent,
});
const results = await experiment.run();
if (results.regressionDetected) {
console.error('Performance regression detected!');
process.exit(1);
}