qwenspeak

YAML-driven text-to-speech over SSH using Qwen3-TTS models.

For installation and deployment, see references/setup.md.

SSH Wrapper

Use scripts/qwenspeak.sh for all commands. It handles host, port, and host key acceptance via QWENSPEAK_HOST and QWENSPEAK_PORT env vars.

scripts/qwenspeak.sh <command> [args]
scripts/qwenspeak.sh <command> < input_file
scripts/qwenspeak.sh <command> > output_file

TTS Generation

Submit YAML, get a job UUID back immediately, poll for progress. Jobs run sequentially — one at a time, the rest queue up.

# Get the YAML template
scripts/qwenspeak.sh "tts print-yaml" > job.yaml

# Submit job
scripts/qwenspeak.sh "tts" < job.yaml
# {"id": "550e8400-...", "status": "queued", "total_steps": 3, "total_generations": 7}

# Check progress
scripts/qwenspeak.sh "tts get-job 550e8400"

# Follow job log
scripts/qwenspeak.sh "tts get-job-log 550e8400 -f"

# Download result
scripts/qwenspeak.sh "get hello.wav" > hello.wav

YAML Structure

Global settings + list of steps. Each step loads a model, runs all its generations, then unloads. Settings cascade: global > step > generation.

steps:
  - mode: custom-voice
    model_size: 1.7b
    speaker: Ryan
    language: English
    generate:
      - text: "Hello world"
        output: hello.wav
      - text: "I cannot believe this!"
        speaker: Vivian
        instruct: "Speak angrily"
        output: angry.wav

  - mode: voice-design
    generate:
      - text: "Welcome to our store."
        instruct: "A warm, friendly young female voice with a cheerful tone"
        output: welcome.wav

  - mode: voice-clone
    model_size: 1.7b
    ref_audio: ref.wav
    ref_text: "Transcript of reference"
    generate:
      - text: "First line in cloned voice"
        output: clone1.wav
      - text: "Second line"
        output: clone2.wav

Modes

custom-voice — Pick from 9 preset speakers. 1.7B supports emotion/style via instruct.

voice-design — Describe the voice in natural language via instruct. 1.7B only.

voice-clone — Clone from reference audio. Set ref_audio and ref_text at step level to reuse across generations. x_vector_only: true skips transcript.

Emotion trick for cloned voices

Upload references with different emotions, use separate steps:

scripts/qwenspeak.sh "create-dir refs"
scripts/qwenspeak.sh "put refs/happy.wav" < me_happy.wav
scripts/qwenspeak.sh "put refs/angry.wav" < me_angry.wav

steps:
  - mode: voice-clone
    ref_audio: refs/happy.wav
    ref_text: "transcript of happy ref"
    generate:
      - text: "Great news everyone!"
        output: happy1.wav

  - mode: voice-clone
    ref_audio: refs/angry.wav
    ref_text: "transcript of angry ref"
    generate:
      - text: "This is unacceptable"
        output: angry1.wav

Job Management

scripts/qwenspeak.sh "tts list-jobs"              # list all
scripts/qwenspeak.sh "tts list-jobs --json"        # JSON output
scripts/qwenspeak.sh "tts get-job <id>"            # job details
scripts/qwenspeak.sh "tts get-job-log <id>"        # view log
scripts/qwenspeak.sh "tts get-job-log <id> -f"     # follow log
scripts/qwenspeak.sh "tts cancel-job <id>"         # cancel

Statuses: queued → running → completed | failed | cancelled

Completed jobs auto-cleaned after 1 day, all jobs after 1 week. UUID prefixes work (e.g. first 8 chars).

File Operations

All paths relative to the work directory. Traversal blocked.

| Command | Description |

| ---------------------- | ---------------------------------- |

| put | Upload file from stdin |

| get | Download file to stdout |

| list-files [--json] | List directory |

| remove-file | Delete a file |

| create-dir | Create directory |

| remove-dir | Remove empty directory |

| move-file | Move or rename |

| copy-file | Copy a file |

| file-exists | Check if file exists (true/false) |

| search-files | Glob search (** recursive) |

Speakers

| -------- | ------ | -------- | ---------------------------------------------- |

YAML Options

All settings cascade: global > step > generation.

| Field | Default | Description |

| -------------------- | --------- | ------------------------------------------------------------------- |

| dtype | float32 | float32, float16, bfloat16 (float16/bfloat16 GPU only) |

| flash_attn | auto | FlashAttention-2: auto-detects, auto-switches float32→bfloat16 |

| temperature | 0.9 | Sampling temperature |

| top_k | 50 | Top-k sampling |

| top_p | 1.0 | Top-p / nucleus sampling |

| repetition_penalty | 1.05 | Repetition penalty |

| max_new_tokens | 2048 | Max codec tokens to generate |

| no_sample | false | Greedy decoding |

| streaming | false | Streaming mode (lower latency) |

| mode | required | Step only: custom-voice, voice-design, or voice-clone |

| model_size | 1.7b | Step only: 1.7b or 0.6b |

| text | required | Text to synthesize |

| output | required | Output file path |

| speaker | Vivian | custom-voice: speaker name |

| language | Auto | Language for synthesis |

| instruct | - | custom-voice: emotion/style; voice-design: voice description |

| ref_audio | - | voice-clone: reference audio file path |

| ref_text | - | voice-clone: transcript of reference audio |

| x_vector_only | false | voice-clone: use speaker embedding only |

qwenspeakv1.5.0

Install & Quick Start