text-to-speechConvert text to natural speech with DIA TTS, Kokoro, Chatterbox, and more via inference.sh CLI. Models: DIA TTS (conversational), Kokoro TTS, Chatterbox, Hig...
Install via ClawdBot CLI:
clawdbot install okaris/text-to-speechConvert text to natural speech via inference.sh CLI.
# Install CLI
curl -fsSL https://cli.inference.sh | sh && infsh login
# Generate speech
infsh app run infsh/kokoro-tts --input '{"text": "Hello, welcome to our product demo."}'
Install note: The install script only detects your OS/architecture, downloads the matching binary from dist.inference.sh, and verifies its SHA-256 checksum. No elevated permissions or background processes. Manual install & verification available.
| Model | App ID | Best For |
|-------|--------|----------|
| DIA TTS | infsh/dia-tts | Conversational, expressive |
| Kokoro TTS | infsh/kokoro-tts | Fast, natural |
| Chatterbox | infsh/chatterbox | General purpose |
| Higgs Audio | infsh/higgs-audio | Emotional control |
| VibeVoice | infsh/vibevoice | Podcasts, long-form |
infsh app list --category audio
infsh app run infsh/kokoro-tts --input '{"text": "Welcome to our tutorial."}'
infsh app sample infsh/dia-tts --save input.json
# Edit input.json:
# {
# "text": "Hey! How are you doing today? I'm really excited to share this with you.",
# "voice": "conversational"
# }
infsh app run infsh/dia-tts --input input.json
infsh app sample infsh/vibevoice --save input.json
# Edit input.json with your podcast script
infsh app run infsh/vibevoice --input input.json
infsh app sample infsh/higgs-audio --save input.json
# {
# "text": "This is absolutely incredible!",
# "emotion": "excited"
# }
infsh app run infsh/higgs-audio --input input.json
Generate speech, then create a talking head video:
# 1. Generate speech
infsh app run infsh/kokoro-tts --input '{"text": "Your script here"}' > speech.json
# 2. Use the audio URL with OmniHuman for avatar video
infsh app run bytedance/omnihuman-1-5 --input '{
"image_url": "https://portrait.jpg",
"audio_url": "<audio-url-from-step-1>"
}'
# Full platform skill (all 150+ apps)
npx skills add inference-sh/skills@inference-sh
# AI avatars (combine TTS with talking heads)
npx skills add inference-sh/skills@ai-avatar-video
# AI music generation
npx skills add inference-sh/skills@ai-music-generation
# Speech-to-text (transcription)
npx skills add inference-sh/skills@speech-to-text
# Video generation
npx skills add inference-sh/skills@ai-video-generation
Browse all apps: infsh app list
Generated Mar 1, 2026
Educators and e-learning platforms can convert textbooks and course materials into audiobooks or narrated videos, enhancing accessibility for students with visual impairments or learning preferences. This supports remote learning by providing audio versions of lectures and tutorials.
Marketing agencies can generate voiceovers for product demos, explainer videos, and social media ads using expressive models like DIA TTS or Higgs Audio. This reduces costs and time compared to hiring voice actors, enabling rapid iteration on campaigns.
Media companies and independent creators can automate podcast episode generation with VibeVoice for long-form content, scripting dialogues with multi-speaker capabilities. This streamlines production for news briefs, storytelling, or branded podcasts.
Businesses can integrate TTS into IVR systems for phone prompts or voice assistants, using conversational models to provide natural-sounding interactions. This improves customer experience in sectors like retail, banking, and healthcare.
Developers and nonprofits can build tools to convert websites, documents, and apps into speech for visually impaired users, leveraging fast models like Kokoro TTS. This promotes inclusivity in digital content across various industries.
Offer a cloud-based TTS service with tiered pricing based on usage volume, such as characters processed per month, targeting businesses and creators. Revenue is generated through monthly or annual subscriptions, with premium features like voice cloning.
License the TTS technology via an API to software developers and enterprises for integration into their applications, charging per API call or with enterprise contracts. This model scales with client usage and supports custom deployments.
Operate a service that produces audiobooks, podcasts, and video narrations for clients using the skill, charging per project or hourly rates. Revenue comes from production fees and potential royalties on distributed content.
💬 Integration Tip
Start by installing the CLI and testing basic commands with sample inputs to understand model outputs before integrating into workflows.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.