qwen-ttsLocal text-to-speech using Qwen3-TTS-12Hz-1.7B-CustomVoice. Use when generating audio from text, creating voice messages, or when TTS is requested. Supports 10 languages including Italian, 9 premium speaker voices, and instruction-based voice control (emotion, tone, style). Alternative to cloud-based TTS services like ElevenLabs. Runs entirely offline after initial model download.
Install via ClawdBot CLI:
clawdbot install paki81/qwen-ttsLocal text-to-speech using Hugging Face's Qwen3-TTS-12Hz-1.7B-CustomVoice model.
Generate speech from text:
scripts/tts.py "Ciao, come va?" -l Italian -o output.wav
With voice instruction (emotion/style):
scripts/tts.py "Sono felice!" -i "Parla con entusiasmo" -l Italian -o happy.wav
Different speaker:
scripts/tts.py "Hello world" -s Ryan -l English -o hello.wav
First-time setup (one-time):
cd skills/public/qwen-tts
bash scripts/setup.sh
This creates a local virtual environment and installs qwen-tts package (~500MB).
Note: First synthesis downloads ~1.7GB model from Hugging Face automatically.
scripts/tts.py [options] "Text to speak"
-o, --output PATH - Output file path (default: qwen_output.wav)-s, --speaker NAME - Speaker voice (default: Vivian)-l, --language LANG - Language (default: Auto)-i, --instruct TEXT - Voice instruction (emotion, style, tone)--list-speakers - Show available speakers--model NAME - Model name (default: CustomVoice 1.7B)Basic Italian speech:
scripts/tts.py "Benvenuto nel futuro del text-to-speech" -l Italian -o welcome.wav
With emotion/instruction:
scripts/tts.py "Sono molto felice di vederti!" -i "Parla con entusiasmo e gioia" -l Italian -o happy.wav
Different speaker:
scripts/tts.py "Hello, nice to meet you" -s Ryan -l English -o ryan.wav
List available speakers:
scripts/tts.py --list-speakers
The CustomVoice model includes 9 premium voices:
| Speaker | Language | Description |
|---------|----------|-------------|
| Vivian | Chinese | Bright, slightly edgy young female |
| Serena | Chinese | Warm, gentle young female |
| Uncle_Fu | Chinese | Seasoned male, low mellow timbre |
| Dylan | Chinese (Beijing) | Youthful Beijing male, clear |
| Eric | Chinese (Sichuan) | Lively Chengdu male, husky |
| Ryan | English | Dynamic male, rhythmic |
| Aiden | English | Sunny American male |
| Ono_Anna | Japanese | Playful female, light nimble |
| Sohee | Korean | Warm female, rich emotion |
Recommendation: Use each speaker's native language for best quality, though all speakers support all 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian).
Use -i, --instruct to control emotion, tone, and style:
Italian examples:
"Parla con entusiasmo""Tono serio e professionale""Voce calma e rilassante""Leggi come un narratore"English examples:
"Speak with excitement""Very happy and energetic""Calm and soothing voice""Read like a narrator"The script outputs the audio file path to stdout (last line), making it compatible with OpenClaw's TTS workflow:
# OpenClaw captures the output path
cd skills/public/qwen-tts
OUTPUT=$(scripts/tts.py "Ciao" -s Vivian -l Italian -o /tmp/audio.wav 2>/dev/null)
# OUTPUT = /tmp/audio.wav
Setup fails:
# Ensure Python 3.10-3.12 is available
python3.12 --version
# Re-run setup
cd skills/public/qwen-tts
rm -rf venv
bash scripts/setup.sh
Model download slow/fails:
# Use mirror (China mainland)
export HF_ENDPOINT=https://hf-mirror.com
scripts/tts.py "Test" -o test.wav
Out of memory (GPU):
The model automatically falls back to CPU if GPU memory insufficient.
Audio quality issues:
--list-speakers-i "Speak clearly and slowly"-l Italian for Italian textGenerated Feb 24, 2026
Content creators and marketers can generate voiceovers for videos, podcasts, or social media in multiple languages without relying on cloud services. This is ideal for producing Italian or other language content with emotion control for engaging storytelling.
Developers can integrate this TTS into applications to provide text-to-speech features for visually impaired users or language learners. The offline capability ensures privacy and reliability in educational or assistive technology tools.
Businesses can use this skill to generate automated voice responses or interactive voice systems in customer support, with support for 10 languages and customizable tones. It offers a cost-effective alternative to cloud-based TTS for localized service.
Individuals or small teams can create personalized voice messages for communication apps or notifications in different languages, leveraging the premium speaker voices and instruction-based emotion control for expressive audio.
AI researchers and hobbyists can quickly prototype TTS functionalities in projects like chatbots or virtual assistants, using the local model to avoid API costs and latency issues during development phases.
Offer a basic version of this TTS skill for free in open-source projects or tools, with premium features like additional speaker voices or advanced emotion controls available via subscription. This attracts users while generating recurring revenue from power users.
License the TTS technology to companies for internal use in applications like training modules or automated systems, with custom support and integration services. This model leverages the offline and multilingual capabilities for secure, scalable solutions.
Create a platform where users can generate and sell voiceovers or audio content using this skill, taking a commission on transactions. This taps into the growing demand for localized and emotive audio in media production.
π¬ Integration Tip
Use the script's stdout output path for seamless integration with workflows like OpenClaw, ensuring audio files are captured automatically for further processing.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.