Voice

Voice agents with text-to-speech and speech-to-text.

Topics

Overview - Introduction to voice agents
Text to Speech - Generate speech from text
Speech to Text - Transcribe speech to text
Speech to Speech - Real-time voice conversations

Voice Overview

Build voice-enabled agents for conversational AI.

Voice Providers

OpenAI
ElevenLabs
Google
Azure
Deepgram
And more...

Capabilities

Text-to-Speech (TTS) - Convert text to spoken audio
Speech-to-Text (STT) - Transcribe audio to text
Speech-to-Speech (STS) - Real-time voice conversations
Voice Activity Detection - Detect when someone is speaking
Background Noise Suppression - Clean audio quality

Getting Started

import { VoiceAgent } from '@mastra/voice';

const agent = new VoiceAgent({
  name: 'voice-assistant',
  model: 'gpt-4',
  voice: {
    provider: 'openai',
    voice: 'alloy',
  },
});

Basic Usage

const audioStream = getMicrophoneStream();

const response = await agent.stream({
  audio: audioStream,
});

pipe(response.audio, speaker);

Text to Speech

Generate speech from text.

Configuration

const tts = new TextToSpeech({
  provider: 'openai',
  voice: 'alloy',
  model: 'tts-1',
  speed: 1.0,
});

Generating Speech

const audioBuffer = await tts.speak({
  text: 'Hello, how can I help you today?',
});

playAudio(audioBuffer);

Voice Options

// OpenAI voices
const voices = ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer'];

// ElevenLabs voices
const voice = await elevenLabs.getVoice('voice-id');

Streaming

const stream = await tts.stream({
  text: 'This is a long response that will be streamed...',
});

pipe(stream, speaker);

SSML Support

const audio = await tts.speak({
  text: `<speak>
    <prosody rate="slow">Hello</prosody>
    <break time="500ms"/>
    <prosody rate="fast">How are you?</prosody>
  </speak>`,
  ssml: true,
});

Speech to Text

Transcribe speech to text.

Configuration

const stt = new SpeechToText({
  provider: 'deepgram',
  model: 'nova-2',
  language: 'en',
});

Transcribing

const audioBuffer = await microphone.record();

const transcription = await stt.transcribe({
  audio: audioBuffer,
});

console.log(transcription.text);
console.log(transcription.words);

Streaming Transcription

const stream = await stt.stream();
const websocket = stream.getWebsocket();

microphone.pipe(websocket);

websocket.on('transcript', (text) => {
  console.log(text);
});

Real-time

const recognizer = stt.createRecognizer();

recognizer.on('transcript', (event) => {
  console.log(event.text);
  console.log(event.isFinal);
});

recognizer.start(microphoneStream);

Features

Multiple languages
Punctuation
Timestamps
Speaker detection
Noise reduction

Speech to Speech

Real-time voice conversations.

Overview

Speech-to-speech enables natural, real-time voice interactions with agents.

Configuration

const sts = new SpeechToSpeech({
  tts: {
    provider: 'openai',
    voice: 'alloy',
  },
  stt: {
    provider: 'deepgram',
    model: 'nova-2',
  },
  model: 'gpt-4o-realtime',
});

Creating a Voice Session

const session = await sts.createSession({
  agent: myAgent,
  voice: 'alloy',
});

session.on('user_speech', (audio) => {
  // User is speaking
});

session.on('agent_response', (audio) => {
  // Agent is responding
});

session.on('error', (error) => {
  console.error('Session error:', error);
});

Starting a Conversation

await session.start();

const audioStream = microphone.start();
audioStream.pipe(session.getInputStream());

session.getOutputStream().pipe(speaker);

Interrupting

// User interrupts the agent
session.interrupt();

Session Management

// Pause the session
await session.pause();

// Resume the session
await session.resume();

// End the session
await session.end();

Use Cases

Voice assistants
Customer support
Interactive demos
Accessibility tools