qwenspeakText-to-speech generation via Qwen3-TTS over SSH. Preset voices, voice cloning, voice design. Use when the user wants to generate speech audio, clone voices,...
Install via ClawdBot CLI:
clawdbot install psyb0t/qwenspeakYAML-driven text-to-speech over SSH using Qwen3-TTS models.
For installation and deployment, see references/setup.md.
Use scripts/qwenspeak.sh for all commands. It handles host, port, and host key acceptance via QWENSPEAK_HOST and QWENSPEAK_PORT env vars.
scripts/qwenspeak.sh <command> [args]
scripts/qwenspeak.sh <command> < input_file
scripts/qwenspeak.sh <command> > output_file
Submit YAML, get a job UUID back immediately, poll for progress. Jobs run sequentially β one at a time, the rest queue up.
# Get the YAML template
scripts/qwenspeak.sh "tts print-yaml" > job.yaml
# Submit job
scripts/qwenspeak.sh "tts" < job.yaml
# {"id": "550e8400-...", "status": "queued", "total_steps": 3, "total_generations": 7}
# Check progress
scripts/qwenspeak.sh "tts get-job 550e8400"
# Follow job log
scripts/qwenspeak.sh "tts get-job-log 550e8400 -f"
# Download result
scripts/qwenspeak.sh "get hello.wav" > hello.wav
Global settings + list of steps. Each step loads a model, runs all its generations, then unloads. Settings cascade: global > step > generation.
steps:
- mode: custom-voice
model_size: 1.7b
speaker: Ryan
language: English
generate:
- text: "Hello world"
output: hello.wav
- text: "I cannot believe this!"
speaker: Vivian
instruct: "Speak angrily"
output: angry.wav
- mode: voice-design
generate:
- text: "Welcome to our store."
instruct: "A warm, friendly young female voice with a cheerful tone"
output: welcome.wav
- mode: voice-clone
model_size: 1.7b
ref_audio: ref.wav
ref_text: "Transcript of reference"
generate:
- text: "First line in cloned voice"
output: clone1.wav
- text: "Second line"
output: clone2.wav
custom-voice β Pick from 9 preset speakers. 1.7B supports emotion/style via instruct.
voice-design β Describe the voice in natural language via instruct. 1.7B only.
voice-clone β Clone from reference audio. Set ref_audio and ref_text at step level to reuse across generations. x_vector_only: true skips transcript.
Upload references with different emotions, use separate steps:
scripts/qwenspeak.sh "create-dir refs"
scripts/qwenspeak.sh "put refs/happy.wav" < me_happy.wav
scripts/qwenspeak.sh "put refs/angry.wav" < me_angry.wav
steps:
- mode: voice-clone
ref_audio: refs/happy.wav
ref_text: "transcript of happy ref"
generate:
- text: "Great news everyone!"
output: happy1.wav
- mode: voice-clone
ref_audio: refs/angry.wav
ref_text: "transcript of angry ref"
generate:
- text: "This is unacceptable"
output: angry1.wav
scripts/qwenspeak.sh "tts list-jobs" # list all
scripts/qwenspeak.sh "tts list-jobs --json" # JSON output
scripts/qwenspeak.sh "tts get-job <id>" # job details
scripts/qwenspeak.sh "tts get-job-log <id>" # view log
scripts/qwenspeak.sh "tts get-job-log <id> -f" # follow log
scripts/qwenspeak.sh "tts cancel-job <id>" # cancel
Statuses: queued β running β completed | failed | cancelled
Completed jobs auto-cleaned after 1 day, all jobs after 1 week. UUID prefixes work (e.g. first 8 chars).
All paths relative to the work directory. Traversal blocked.
| Command | Description |
| ---------------------- | ---------------------------------- |
| put | Upload file from stdin |
| get | Download file to stdout |
| list-files [--json] | List directory |
| remove-file | Delete a file |
| create-dir | Create directory |
| remove-dir | Remove empty directory |
| move-file | Move or rename |
| copy-file | Copy a file |
| file-exists | Check if file exists (true/false) |
| search-files | Glob search (** recursive) |
| Speaker | Gender | Language | Description |
| -------- | ------ | -------- | ---------------------------------------------- |
| Vivian | Female | Chinese | Bright, slightly edgy young voice |
| Serena | Female | Chinese | Warm, gentle young voice |
| Uncle_Fu | Male | Chinese | Seasoned, low mellow timbre |
| Dylan | Male | Chinese | Youthful Beijing dialect, clear natural timbre |
| Eric | Male | Chinese | Lively Chengdu/Sichuan dialect, slightly husky |
| Ryan | Male | English | Dynamic with strong rhythmic drive |
| Aiden | Male | English | Sunny American, clear midrange |
| Ono_Anna | Female | Japanese | Playful, light nimble timbre |
| Sohee | Female | Korean | Warm with rich emotion |
All settings cascade: global > step > generation.
| Field | Default | Description |
| -------------------- | --------- | ------------------------------------------------------------------- |
| dtype | float32 | float32, float16, bfloat16 (float16/bfloat16 GPU only) |
| flash_attn | auto | FlashAttention-2: auto-detects, auto-switches float32βbfloat16 |
| temperature | 0.9 | Sampling temperature |
| top_k | 50 | Top-k sampling |
| top_p | 1.0 | Top-p / nucleus sampling |
| repetition_penalty | 1.05 | Repetition penalty |
| max_new_tokens | 2048 | Max codec tokens to generate |
| no_sample | false | Greedy decoding |
| streaming | false | Streaming mode (lower latency) |
| mode | required | Step only: custom-voice, voice-design, or voice-clone |
| model_size | 1.7b | Step only: 1.7b or 0.6b |
| text | required | Text to synthesize |
| output | required | Output file path |
| speaker | Vivian | custom-voice: speaker name |
| language | Auto | Language for synthesis |
| instruct | - | custom-voice: emotion/style; voice-design: voice description |
| ref_audio | - | voice-clone: reference audio file path |
| ref_text | - | voice-clone: transcript of reference audio |
| x_vector_only | false | voice-clone: use speaker embedding only |
Generated Mar 1, 2026
Generate custom voice prompts for IVR systems or chatbots in multiple languages using preset speakers like Ryan for English and Vivian for Chinese. This ensures consistent brand voice across global customer interactions without hiring voice actors.
Create narrated audio for e-learning modules or audiobooks by cloning a teacher's voice or designing friendly voices via voice-design mode. This allows scalable production of personalized educational materials in various tones and languages.
Produce dynamic voiceovers for commercials or social media ads using emotion tricks with cloned voices or preset speakers like Aiden for sunny American tones. Enables rapid iteration on ad campaigns with tailored emotional delivery.
Convert text documents or web content into speech audio using voice-clone mode to replicate a familiar voice for personalized assistance. Supports multiple languages and emotional styles to enhance user experience.
Design unique character voices for games or animations using voice-design mode with natural language descriptions. Allows creators to prototype and produce diverse vocal performances without studio recordings.
Offer a cloud-based API service where users pay monthly for access to qwenspeak's TTS features, including voice cloning and design. Revenue comes from tiered plans based on usage limits and advanced features like emotion control.
Provide bespoke integration and voice training services for businesses needing branded or cloned voices. Charge project-based fees for setup, customization, and ongoing support, targeting industries like customer service and education.
Operate a platform where users pay per audio file generated, with pricing based on factors like model size or voice mode. Attract indie creators and small businesses with low upfront costs and scalable usage.
π¬ Integration Tip
Ensure SSH access and environment variables are configured properly; use YAML templates for batch processing to streamline job submissions and management.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.