pocket-ttsGenerate high-quality English speech offline on CPU using 8 built-in voices or custom voice cloning with Kyutai's Pocket TTS model.
Install via ClawdBot CLI:
clawdbot install sherajdev/pocket-ttsFully local, offline text-to-speech using Kyutai's Pocket TTS model. Generate high-quality audio from text without any API calls or internet connection. Features 8 built-in voices, voice cloning support, and runs entirely on CPU.
# 1. Accept the model license on Hugging Face
# https://huggingface.co/kyutai/pocket-tts
# 2. Install the package
pip install pocket-tts
# Or use uv for automatic dependency management
uvx pocket-tts generate "Hello world"
# Basic usage
pocket-tts "Hello, I am your AI assistant"
# With specific voice
pocket-tts "Hello" --voice alba --output hello.wav
# With custom voice file (voice cloning)
pocket-tts "Hello" --voice-file myvoice.wav --output output.wav
# Adjust speed
pocket-tts "Hello" --speed 1.2
# Start local server
pocket-tts --serve
# List available voices
pocket-tts --list-voices
from pocket_tts import TTSModel
import scipy.io.wavfile
# Load model
tts_model = TTSModel.load_model()
# Get voice state
voice_state = tts_model.get_state_for_audio_prompt(
"hf://kyutai/tts-voices/alba-mackenna/casual.wav"
)
# Generate audio
audio = tts_model.generate_audio(voice_state, "Hello world!")
# Save to WAV
scipy.io.wavfile.write("output.wav", tts_model.sample_rate, audio.numpy())
# Check sample rate
print(f"Sample rate: {tts_model.sample_rate} Hz")
| Voice | Description |
|-------|-------------|
| alba | Casual female voice |
| marius | Male voice |
| javert | Clear male voice |
| jean | Natural male voice |
| fantine | Female voice |
| cosette | Female voice |
| eponine | Female voice |
| azelma | Female voice |
Or use --voice-file /path/to/wav.wav for custom voice cloning.
| Option | Description | Default |
|--------|-------------|---------|
| text | Text to convert | Required |
| -o, --output | Output WAV file | output.wav |
| -v, --voice | Voice preset | alba |
| -s, --speed | Speech speed (0.5-2.0) | 1.0 |
| --voice-file | Custom WAV for cloning | None |
| --serve | Start HTTP server | False |
| --list-voices | List all voices | False |
Generated Mar 1, 2026
Enables text-to-speech in educational apps for students in low-connectivity areas, such as language learning platforms or e-readers. It supports voice cloning for personalized narration without internet dependency.
Integrates into on-premise customer service systems for businesses needing offline voice responses, like retail kiosks or factory assistance tools. Uses built-in voices or clones brand-specific tones.
Provides speech synthesis for accessibility features in software, such as screen readers for visually impaired users in offline environments. Runs on standard CPUs without GPU requirements.
Adds voice output to IoT devices like smart home assistants or industrial sensors that operate offline. Its low latency and CPU-only design suit resource-constrained hardware.
Supports local audio generation for content creators, such as podcasters or video editors needing voiceovers without cloud APIs. Voice cloning allows for custom character voices.
Offer a subscription-based service where businesses integrate Pocket TTS into their software for offline TTS capabilities. Revenue comes from monthly fees based on usage tiers or enterprise licenses.
Bundle the skill with hardware products like educational tablets or IoT devices that require offline speech synthesis. Revenue is generated through product sales and licensing agreements with manufacturers.
Provide a free basic version for developers, with premium features like advanced voice cloning or priority support. Revenue comes from paid upgrades and consulting services for custom integrations.
๐ฌ Integration Tip
Ensure the Hugging Face model license is accepted before installation, and use the CLI for quick testing before Python API integration.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.