alicloud-ai-audio-tts-realtimeReal-time speech synthesis with Alibaba Cloud Model Studio Qwen TTS Realtime models. Use when low-latency interactive speech is required, including instruction-controlled realtime synthesis.
Install via ClawdBot CLI:
clawdbot install cinience/alicloud-ai-audio-tts-realtimeCategory: provider
Use realtime TTS models for low-latency streaming speech output.
Use one of these exact model strings:
qwen3-tts-flash-realtimeqwen3-tts-instruct-flash-realtimeqwen3-tts-instruct-flash-realtime-2026-01-22python3 -m venv .venv
. .venv/bin/activate
python -m pip install dashscope
DASHSCOPE_API_KEY in your environment, or add dashscope_api_key to ~/.alibabacloud/credentials.text (string, required)voice (string, required)instruction (string, optional)sample_rate (int, optional)audio_base64_pcm_chunks (arraysample_rate (int)finish_reason (string)MultiModalConversation; use the probe script below to verify compatibility.Use the probe script to verify realtime compatibility in your current SDK/runtime, and optionally fallback to a non-realtime model for immediate output:
.venv/bin/python skills/ai/audio/alicloud-ai-audio-tts-realtime/scripts/realtime_tts_demo.py \
--text "θΏζ―δΈδΈͺ realtime θ―ι³ζΌη€Ίγ" \
--fallback \
--output output/ai-audio-tts-realtime/audio/fallback-demo.wav
Strict mode (for CI / gating):
.venv/bin/python skills/ai/audio/alicloud-ai-audio-tts-realtime/scripts/realtime_tts_demo.py \
--text "realtime health check" \
--strict
output/ai-audio-tts-realtime/audio/OUTPUT_DIR.references/sources.mdGenerated Mar 1, 2026
Enables real-time speech synthesis for voice assistants in smart home devices or customer service bots, allowing immediate vocal responses to user queries. Low latency ensures natural, conversational interactions without noticeable delays.
Supports dynamic voice generation for live streams or video games, such as real-time commentary or character dialogue. The streaming capability allows for on-the-fly audio updates based on user inputs or game events.
Facilitates interactive learning by providing instant speech feedback in language learning apps or virtual tutors. Instruction-controlled models can adapt pronunciation or tone based on learner progress.
Powers real-time text-to-speech for visually impaired users in applications like screen readers or navigation aids. Low latency ensures timely audio feedback for enhanced usability and independence.
Integrates into interactive voice response (IVR) systems for call centers, enabling dynamic speech synthesis based on caller inputs. This reduces pre-recorded audio needs and allows for personalized responses.
Monetize the skill by offering it as a pay-per-use API for developers, charging based on the number of requests or audio minutes generated. This model scales with usage and targets businesses needing real-time TTS without infrastructure overhead.
Provide the skill as part of a subscription-based software platform for industries like customer service or education, with tiered pricing based on features or volume. This ensures recurring revenue and long-term customer engagement.
License the skill to enterprises for integration into their proprietary products, such as smart devices or internal tools, with one-time or ongoing licensing fees. This model targets large organizations seeking customized, branded solutions.
π¬ Integration Tip
Ensure compatibility by testing with the provided demo script before deployment, and use websocket endpoints for optimal real-time performance.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.