Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Convert text to natural speech, transcribe audio to text, and translate spoken content with AI.
These skills handle the full audio pipeline — synthesizing voices with ElevenLabs and OpenAI TTS, transcribing recordings with Whisper, translating spoken content across languages, and processing audio files. Used by podcasters, developers, and accessibility teams.
Convert text to natural-sounding speech with AI voice models and custom voice styles.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Speak responses aloud on macOS using the built-in `say` command when user input indicates Voice Wake/voice recognition (for example, messages starting with "User talked via voice recognition on <device>").
Transcribe audio files to text using local Whisper (Docker). Use when receiving voice messages, audio files (.mp3, .m4a, .ogg, .wav, .webm), or when asked to transcribe audio content.
Turn your AI into JARVIS. Voice, wit, and personality — the complete package. Humor cranked to maximum.
Quick install — most popular voice & audio ai skill:
clawdbot install steipete/openai-whisper-api553 skills found
Page 1 of 24
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Speak responses aloud on macOS using the built-in `say` command when user input indicates Voice Wake/voice recognition (for example, messages starting with "User talked via voice recognition on <device>").
Transcribe audio files to text using local Whisper (Docker). Use when receiving voice messages, audio files (.mp3, .m4a, .ogg, .wav, .webm), or when asked to transcribe audio content.
Turn your AI into JARVIS. Voice, wit, and personality — the complete package. Humor cranked to maximum.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Local Voice Input/Output for Agents using the AI Voice Agent API.
Text-to-speech via OpenAI Audio Speech API.
Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. SRT...
ElevenLabs TTS (Text-to-Speech) with emotional audio tags for expressive voice synthesis. WhatsApp-compatible voice messages with Opus conversion. Supports 7...
Transcribe audio to text with Whisper models via inference.sh CLI. Models: Fast Whisper Large V3, Whisper V3 Large. Capabilities: transcription, translation,...
High-quality voice synthesis with 18 personas, 32 languages, sound effects, batch processing, and voice design using ElevenLabs API.
Control Amazon Alexa devices and smart home via the `alexacli` CLI. Use when a user asks to speak/announce on Echo devices, control lights/thermostats/locks, send voice commands, or query Alexa.
AI audio generation powered by CellCog. Text-to-speech, voice synthesis, voiceovers, podcast audio, narration, music generation, background music, sound design. Professional audio creation with AI.
Real-time voice conversations in Discord voice channels with Claude AI
Transcribe audio files using OpenAI's gpt-4o-mini-transcribe model with vocabulary hints and text replacements. Requires uv (https://docs.astral.sh/uv/).
Generate music from text prompts using ElevenLabs Eleven Music API. Use when creating songs, soundtracks, jingles, lullabies, or any audio music from descriptions. Supports vocals with AI-generated lyrics, instrumental tracks, and multiple genres/styles. Requires paid ElevenLabs plan.
Local text-to-speech using Qwen3-TTS-12Hz-1.7B-CustomVoice. Use when generating audio from text, creating voice messages, or when TTS is requested. Supports 10 languages including Italian, 9 premium speaker voices, and instruction-based voice control (emotion, tone, style). Alternative to cloud-based TTS services like ElevenLabs. Runs entirely offline after initial model download.
通用语音识别 Skill。支持多种音频格式(ogg/mp3/wav/m4a),使用硅基流动 SenseVoice API 进行语音转文字。当用户发送语音消息、音频文件,或需要转录音频时触发。
本地生成 Telegram 语音消息,支持自动清洗、分段与临时文件管理。
Handles voice-to-voice conversations on WhatsApp. Automatically transcribes incoming audio and responds with local TTS audio. Use when the user wants to "talk" instead of type.
Generate spoken audio from text using the local Kokoro TTS engine. Use when the user asks to "say" something, requests a voice message, or wants text converted to speech.
Skills integrate with ElevenLabs, OpenAI TTS, Murf, Play.ht, Azure Cognitive Services, and Google Cloud TTS — covering natural voices, voice cloning, and custom voice styles.
Yes. Whisper-based skills handle long-form audio by chunking and processing in segments, with speaker diarization and timestamp output. Cloud-based skills handle files up to several hours.