Best OpenClaw Skills for Voice & Audio: Whisper STT, ElevenLabs TTS & Voice Agents
Voice is the most natural interface for AI, yet it remains one of the harder engineering problems: audio pipelines require local models or API keys, format handling (OGG, MP3, WAV, M4A) varies across platforms, and latency matters in ways that don't apply to text. OpenClaw's speech ecosystem has matured significantly, with skills covering the full audio stack from speech recognition to voice agents to AI music generation.
Note: Install and download figures in text descriptions reflect stats at the time of writing and may be outdated. All skill tables are live — they fetch current data from the ClawHub database on every page load. Treat table values as authoritative.
By the Numbers
| Metric | Value |
|---|---|
| Skills in this guide | 20 |
| Workflow stages covered | 4 |
| Top skill by installs | openai-whisper ( installs) |
| Top skill by downloads | openai-whisper ( downloads) |
| STT skills | 6 |
| TTS skills | 6 |
| Voice agent skills | 5 |
1. Speech-to-Text: Whisper & Transcription
Whisper has become the de facto standard for speech-to-text in the OpenClaw ecosystem. openai-whisper leads the entire speech category with 356 installs and 13,413 downloads — it runs the Whisper CLI locally with no API required, which explains the dominance: no cost per minute, no data leaving the device, and support for dozens of languages. faster-whisper (22 installs, 5,072 downloads) is the performance variant — it uses the CTranslate2 runtime to achieve 4-6× faster transcription than standard Whisper on the same hardware. transcribe (27 installs, 2,614 downloads) is the simpler wrapper for single-file transcription without the full Whisper CLI setup.
2. Text-to-Speech: ElevenLabs, OpenAI & Edge
TTS skills split between cloud APIs and local/free options. sag (272 installs, 6,811 downloads) is the ElevenLabs integration with a macOS say-style UX — it wraps ElevenLabs' API into a familiar interface that feels native to the terminal. elevenlabs-voices (20 installs, 5,571 downloads) provides 18 persona voices and 32 language options with fine-grained control over stability and style. For teams without an ElevenLabs subscription, edge-tts (25 installs, 3,810 downloads) provides free TTS via Microsoft Edge's text-to-speech service with no API key required — a zero-cost alternative for non-critical use cases. openai-tts (22 installs, 3,474 downloads) covers OpenAI's Audio Speech API for teams already in the OpenAI ecosystem.
3. Voice Agents & Conversational Voice
Voice agent skills go beyond transcription and synthesis — they create persistent voice-interactive experiences. jarvis-voice (27 installs, 4,390 downloads) turns the agent into a JARVIS-style assistant with voice input and witty spoken responses. voice-wake-say (36 installs, 5,852 downloads) adds macOS voice output to any agent using the system's built-in say command — zero API cost, works offline. vocal-chat (12 installs, 2,640 downloads) handles voice-to-voice conversations on WhatsApp, bridging the phone messaging interface to AI voice responses. discord-voice (14 installs, 3,329 downloads) enables real-time voice conversations in Discord voice channels — the most active community voice integration on the platform.
4. Audio Generation & Music
AI-generated music is an emerging category with a small but active footprint. elevenlabs-music (13 installs, 2,538 downloads) generates music from text prompts using ElevenLabs' music generation API — covering instrumental generation, sound effects, and voice cloning for music applications. alexa-cli (22 installs, 3,566 downloads) sits in an adjacent space: controlling Amazon Alexa devices and smart home speakers via CLI, which includes audio playback control that extends beyond pure generation.
Recommended Combinations
| Your situation | Recommended stack |
|---|---|
| Local transcription, no API costs | openai-whisper |
| Fast transcription on limited hardware | faster-whisper |
| High-quality TTS voices (ElevenLabs) | sag + elevenlabs-voices |
| Free TTS without API key | edge-tts or voice-wake-say |
| Full voice agent (input + output) | openai-whisper + sag + jarvis-voice |
| WhatsApp voice conversations | vocal-chat |
| Discord voice integration | discord-voice |
| AI music from text prompts | elevenlabs-music |
A Few Observations
Whisper has won the STT category outright. With 356 installs, openai-whisper has more than 10× the installs of its nearest competitors in transcription. The combination of local execution, no per-minute cost, and strong multilingual performance has made it the default — to the point where most other STT approaches don't register meaningfully.
faster-whisper is the production upgrade path. Once teams commit to Whisper, the next question is latency. faster-whisper at 4-6× speed improvement on the same hardware is a compelling drop-in replacement. The 22 installs vs 356 for standard Whisper suggests most users start with standard and don't bother optimizing, which means there's significant untapped performance available.
ElevenLabs dominates TTS, but free alternatives are real. sag (272 installs) dwarfs everything else in TTS. But edge-tts and voice-wake-say have meaningful install counts, suggesting a segment of users who need voice output but can't or won't pay for ElevenLabs. The free tier is a real category, not just a fallback.
Voice + messaging platform integrations are underexplored. vocal-chat (WhatsApp) and discord-voice (Discord) are the only skills that bridge AI voice to a messaging platform's native voice feature. Given how much human voice communication happens in these apps, this is a surprisingly thin category — most AI voice integrations assume a standalone terminal or desktop app.
The jarvis-voice framing matters. The most-downloaded voice agent skill is named after the Marvel AI assistant. This isn't an accident — "JARVIS" is the cultural shorthand for "AI that sounds like it belongs in your life." Skills that frame themselves around familiar AI archetypes tend to get higher adoption than functionally equivalent tools with technical names.
Data source: ClawHub platform install and download counts as of April 12, 2026. Visit clawhub-skills.com to search for more skills.