elevenlabs-ttsElevenLabs TTS (Text-to-Speech) with emotional audio tags for expressive voice synthesis. WhatsApp-compatible voice messages with Opus conversion. Supports 7...
Install via ClawdBot CLI:
clawdbot install Shaharsha/elevenlabs-ttsGenerate expressive voice messages using ElevenLabs v3 with audio tags.
ELEVENLABS_API_KEY): Required. Get one at elevenlabs.io โ Profile โ API Keys. Configure in openclaw.json under messages.tts.elevenlabs.apiKey.Storytelling (emotional journey):
[soft] It started like any other day... [pause] But something felt different. [nervous] My hands were shaking as I opened the envelope. [gasps] I got in! [excited] I actually got in! [laughs] [happy] This changes everything!
Horror/Suspense (building dread):
[whispers] The house has been empty for years... [pause] At least, that's what they told me. [nervous] But I keep hearing footsteps. [scared] They're getting closer. [gasps] [panicking] The doorโ it's opening by itself!
Conversation with reactions:
[curious] So what happened at the meeting? [pause] [surprised] Wait, they fired him?! [gasps] [sad] That's terrible... [sighs] He had a family. [thoughtful] I wonder what he'll do now.
Hebrew (romantic moment):
[soft] ืืื ืขืืื ืฉื, ืืื ืืฉืงืืขื... [pause] ืืื ืฉืื ืคืขื ืื ืื ืืืง. [nervous] ืื ืืืขืชื ืื ืืืืื. [hesitates] ืื ื... [breathes] [tender] ืืช ืืืืขืช ืฉืื ื ืืืื ืืืชื, ื ืืื?
Spanish (celebration to reflection):
[excited] ยกLo logramos! [laughs] [happy] No puedo creerlo... [pause] [thoughtful] Fueron tantos aรฑos de trabajo. [emotional] [soft] Gracias a todos los que creyeron en mรญ. [sighs] [content] Valiรณ la pena cada momento.
In openclaw.json, configure TTS under messages.tts:
{
"messages": {
"tts": {
"provider": "elevenlabs",
"elevenlabs": {
"apiKey": "sk_your_api_key_here",
"voiceId": "pNInz6obpgDQGcFmaJgB",
"modelId": "eleven_v3",
"languageCode": "en",
"voiceSettings": {
"stability": 0.5,
"similarityBoost": 0.75,
"style": 0,
"useSpeakerBoost": true,
"speed": 1
}
}
}
}
}
Getting your API Key:
These premade voices are optimized for v3 and work well with audio tags:
| Voice | ID | Gender | Accent | Best For |
|-------|-----|--------|--------|----------|
| Adam | pNInz6obpgDQGcFmaJgB | Male | American | Deep narration, general use |
| Rachel | 21m00Tcm4TlvDq8ikWAM | Female | American | Calm narration, conversational |
| Brian | nPczCjzI2devNBz1zQrb | Male | American | Deep narration, podcasts |
| Charlotte | XB0fDUnXU5powFXDhCwa | Female | English-Swedish | Expressive, video games |
| George | JBFqnCBsd6RMkjVDRZzb | Male | British | Raspy narration, storytelling |
Finding more voices:
GET https://api.elevenlabs.io/v1/voicesVoice selection tips:
eleven_v3 (alpha) - ONLY model supporting audio tags| Mode | Stability | Description |
|------|-----------|-------------|
| Creative | 0.3-0.5 | More emotional/expressive, may hallucinate |
| Natural | 0.5-0.7 | Balanced, closest to original voice |
| Robust | 0.7-1.0 | Highly stable, less responsive to tags |
For audio tags, use Creative (0.5) or Natural. Higher stability reduces tag responsiveness.
Range: 0.7 (slow) to 1.2 (fast), default 1.0
Extreme values affect quality. For pacing, prefer audio tags like [rushed] or [drawn out].
How many tags to use:
Where to place tags:
Context matters:
[nervous] I... I'm not sure about this. What if it doesn't work? works better than [nervous] Hello.Combine tags for nuance:
[nervously][whispers] = nervous whispering[excited][laughs] = excited laughterRegenerate for best results:
Match tag to voice:
[shouts] on a whispering voice[whispers] on a loud/energetic voicev3 does NOT support SSML break tags. Use audio tags and punctuation instead.
Punctuation enhances audio tags:
[nervous] I... I don't know...[excited] That's AMAZING![explaining] So what you do isโ [interrupting] Wait![nervous] Are you sure about this?[happy] We did it!Combine tags + punctuation for maximum effect:
[tired] It was a long day... [sighs] Nobody listens anymore.
tts tool (returns MP3)message tool1. Generate TTS (add [pause] at end to prevent cutoff):
tts text="[excited] This is amazing! [pause]" channel=whatsapp
Returns: MEDIA:/tmp/tts-xxx/voice-123.mp3
2. Convert MP3 โ Opus:
ffmpeg -i /tmp/tts-xxx/voice-123.mp3 -c:a libopus -b:a 64k -vbr on -application voip /tmp/tts-xxx/voice-123.ogg
3. Send the Opus file:
Note: The message field below contains a Unicode Left-to-Right Mark (U+200E) between the quotes.
This is intentional โ WhatsApp requires a non-empty message body to send voice notes.
The LTR mark is invisible but satisfies this requirement without displaying any text.
message action=send channel=whatsapp target="+972..." filePath="/tmp/tts-xxx/voice-123.ogg" asVoice=true message="โ"
| Format | iOS | Android | Transcribe |
|--------|-----|---------|------------|
| MP3 | โ Works | โ May fail | โ No |
| Opus (.ogg) | โ Works | โ Works | โ Yes |
Always convert to Opus - it's the only format that:
ElevenLabs sometimes cuts off the last word. Always add [pause] or ... at the end:
[excited] This is amazing! [pause]
For content >800 chars:
tts tool
cat > list.txt << EOF
file '/path/file1.mp3'
file '/path/file2.mp3'
EOF
ffmpeg -f concat -safe 0 -i list.txt -c copy final.mp3
Important: Don't mention "part 2" or "chapter" - keep it seamless.
v3 can handle multiple characters in one generation:
Jessica: [whispers] Did you hear that?
Chris: [interrupting] โI heard it too!
Jessica: [panicking] We need to hide!
Dialogue tags: [interrupting], [overlapping], [cuts in], [interjecting]
| Category | Tags | When to Use |
|----------|------|-------------|
| Emotions | [excited], [happy], [sad], [angry], [nervous], [curious] | Main emotional state - use 1 per section |
| Delivery | [whispers], [shouts], [soft], [rushed], [drawn out] | Volume/speed changes |
| Reactions | [laughs], [sighs], [gasps], [clears throat], [gulps] | Natural human moments - sprinkle sparingly |
| Pacing | [pause], [hesitates], [stammers], [breathes] | Dramatic timing |
| Character | [French accent], [British accent], [robotic tone] | Character voice shifts |
| Dialogue | [interrupting], [overlapping], [cuts in] | Multi-speaker conversations |
Most effective tags (reliable results):
[excited], [nervous], [sad], [happy][laughs], [sighs], [whispers][pause]Less reliable (test and regenerate):
[explosion], [gunshot]Full tag list: See references/audio-tags.md
Tags read aloud?
eleven_v3 modelVoice inconsistent?
WhatsApp won't play?
No emotion despite tags?
Generated Mar 1, 2026
Create expressive audio content for podcasts by generating narration with emotional audio tags like [excited] or [thoughtful], enabling dynamic storytelling without voice actors. Ideal for indie podcasters or media companies needing multilingual episodes with consistent voice quality.
Enhance communication on WhatsApp by converting text messages into realistic voice notes with emotional nuances, such as [happy] or [sad], for personal or business use. Useful for customer service bots or social interactions where tone matters.
Develop engaging learning materials by generating voiceovers for e-learning modules, tutorials, or language lessons with multilingual support and audio tags like [curious] or [explaining]. Supports educators and edtech platforms in creating accessible audio content.
Produce immersive audio experiences for horror or suspense genres by using tags like [whispers] and [scared] to build tension, suitable for audiobooks, games, or interactive media. Appeals to content creators in entertainment and gaming industries.
Implement AI-driven voice responses for customer support systems in multiple languages with emotional cues like [helpful] or [apologetic], improving user experience and efficiency. Targets businesses in retail, tech, or hospitality sectors.
Offer tiered subscription plans for developers or businesses to access the ElevenLabs TTS skill, with pricing based on usage limits, voice options, and support levels. Revenue streams include monthly fees and overage charges for high-volume users.
Provide a free basic version for individual creators to generate voice content with limited tags, while charging for premium features like advanced audio tags, higher quality voices, and bulk processing. Monetizes through upgrades and in-app purchases.
Partner with companies to integrate the TTS skill into their platforms, such as WhatsApp bots or e-learning systems, offering customization, training, and ongoing support. Revenue comes from licensing fees, setup costs, and maintenance contracts.
๐ฌ Integration Tip
Ensure ffmpeg is installed for audio conversion and set the ELEVENLABS_API_KEY in openclaw.json to avoid errors during voice generation.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.