mastra/Core Concepts
Voice
Voice agents with text-to-speech and speech-to-text capabilities
voicettssttspeechaudio
Voice
Voice agents with text-to-speech and speech-to-text.
Topics
- Overview - Introduction to voice agents
- Text to Speech - Generate speech from text
- Speech to Text - Transcribe speech to text
- Speech to Speech - Real-time voice conversations
Voice Overview
Build voice-enabled agents for conversational AI.
Voice Providers
- OpenAI
- ElevenLabs
- Azure
- Deepgram
- And more...
Capabilities
- Text-to-Speech (TTS) - Convert text to spoken audio
- Speech-to-Text (STT) - Transcribe audio to text
- Speech-to-Speech (STS) - Real-time voice conversations
- Voice Activity Detection - Detect when someone is speaking
- Background Noise Suppression - Clean audio quality
Getting Started
import { VoiceAgent } from '@mastra/voice';
const agent = new VoiceAgent({
name: 'voice-assistant',
model: 'gpt-4',
voice: {
provider: 'openai',
voice: 'alloy',
},
});
Basic Usage
const audioStream = getMicrophoneStream();
const response = await agent.stream({
audio: audioStream,
});
pipe(response.audio, speaker);
Text to Speech
Generate speech from text.
Configuration
const tts = new TextToSpeech({
provider: 'openai',
voice: 'alloy',
model: 'tts-1',
speed: 1.0,
});
Generating Speech
const audioBuffer = await tts.speak({
text: 'Hello, how can I help you today?',
});
playAudio(audioBuffer);
Voice Options
// OpenAI voices
const voices = ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer'];
// ElevenLabs voices
const voice = await elevenLabs.getVoice('voice-id');
Streaming
const stream = await tts.stream({
text: 'This is a long response that will be streamed...',
});
pipe(stream, speaker);
SSML Support
const audio = await tts.speak({
text: `<speak>
<prosody rate="slow">Hello</prosody>
<break time="500ms"/>
<prosody rate="fast">How are you?</prosody>
</speak>`,
ssml: true,
});
Speech to Text
Transcribe speech to text.
Configuration
const stt = new SpeechToText({
provider: 'deepgram',
model: 'nova-2',
language: 'en',
});
Transcribing
const audioBuffer = await microphone.record();
const transcription = await stt.transcribe({
audio: audioBuffer,
});
console.log(transcription.text);
console.log(transcription.words);
Streaming Transcription
const stream = await stt.stream();
const websocket = stream.getWebsocket();
microphone.pipe(websocket);
websocket.on('transcript', (text) => {
console.log(text);
});
Real-time
const recognizer = stt.createRecognizer();
recognizer.on('transcript', (event) => {
console.log(event.text);
console.log(event.isFinal);
});
recognizer.start(microphoneStream);
Features
- Multiple languages
- Punctuation
- Timestamps
- Speaker detection
- Noise reduction
Speech to Speech
Real-time voice conversations.
Overview
Speech-to-speech enables natural, real-time voice interactions with agents.
Configuration
const sts = new SpeechToSpeech({
tts: {
provider: 'openai',
voice: 'alloy',
},
stt: {
provider: 'deepgram',
model: 'nova-2',
},
model: 'gpt-4o-realtime',
});
Creating a Voice Session
const session = await sts.createSession({
agent: myAgent,
voice: 'alloy',
});
session.on('user_speech', (audio) => {
// User is speaking
});
session.on('agent_response', (audio) => {
// Agent is responding
});
session.on('error', (error) => {
console.error('Session error:', error);
});
Starting a Conversation
await session.start();
const audioStream = microphone.start();
audioStream.pipe(session.getInputStream());
session.getOutputStream().pipe(speaker);
Interrupting
// User interrupts the agent
session.interrupt();
Session Management
// Pause the session
await session.pause();
// Resume the session
await session.resume();
// End the session
await session.end();
Use Cases
- Voice assistants
- Customer support
- Interactive demos
- Accessibility tools