local-piper-tts-multilang-secureLocal offline text-to-speech via Piper TTS. Self-contained setup, automatic language detection, per-call voice selection. Extensible to any language. Writes...
Install via ClawdBot CLI:
clawdbot install szafranski/local-piper-tts-multilang-secureLocal (offline) text-to-speech via Piper.
Purpose: generate audio files (OGG/Opus by default) from text, fully offline.
No sending is performed by the skill — sending is handled by the agent after the file is ready.
setup() — installs Piper into an isolated venv, no system-wide changesvoice parameterdownloadVoices() — no models bundled, choose what you needremoveVoice() — clean up voices you no longer want.onnx modelFollow this sequence exactly when the user asks to use TTS for the first time in a setup context.
const s = await status();
If s.stage is not-setup or no-piper:
setup().status() again after setup completes.If s.stage is no-model (Piper installed but no .onnx files):
3a. Offer English defaults:
Explain that two English voices are available as defaults (~65 MB each):
en_US-ryan-medium — male, Americanen_US-amy-medium — female, AmericanAsk which they want, or both: "Which English voice(s) should I download? Ryan (male), Amy (female), or both?"
3b. Ask about other languages:
After the English choice, ask: "Do you need any other languages? For example German, French, Spanish, Polish, Italian, Portuguese, Russian… Just tell me and I'll check what's available."
If the user names a language, look up the available models at https://github.com/rhasspy/piper/blob/master/VOICES.md and list the options. Download whatever the user picks using the same downloadVoices() call.
3c. Download everything at once:
const result = await downloadVoices(['en_US-ryan-medium', 'en_US-amy-medium', /* + any others */]);
// result.downloaded — succeeded
// result.failed — [{stem, error}] if any failed
Each voice requires internet access. Download takes ~1–2 min per voice on a typical connection.
If any downloads fail:
After downloading, generate a short audio sample for each downloaded voice and send it to the user.
For each voice, use a greeting in the voice's language:
"Hello, I'm [name]. How can I help you today?""Hallo, ich heiße [Name]. Wie kann ich Ihnen helfen?""Bonjour, je m'appelle [prénom]. Comment puis-je vous aider?""Hola, me llamo [nombre]. ¿Cómo puedo ayudarte?""Cześć, mam na imię [imię]. Jak mogę Ci pomóc?""Ciao, mi chiamo [nome]. Come posso aiutarti?""Olá, meu nome é [nome]. Como posso ajudar?""Привет, меня зовут [имя]. Чем могу помочь?"Replace [name] with the voice name (e.g. Ryan, Amy, Thorsten).
const sample = await tts({ text: 'Hello, I\'m Ryan. How can I help you today?', voice: 'en_US-ryan-medium' });
// send sample.path to the user as a voice message
Send all samples, then ask: "Which voice do you prefer? Or shall I download a different one?"
After the user picks a voice, ask:
"How fast should I speak? Normal is 100%. Some options: 125% (faster), 115% (slightly faster), 100% (normal), 80% (slower) — or tell me a percentage."
Always present speed as a percentage to the user. Never mention lengthScale directly.
lengthScale is the internal duration multiplier — lower = faster. To convert: lengthScale = 1 / (speed% / 100).
Examples:
Generate a short sample at the chosen speed so the user can hear the difference:
const sample = await tts({ text: 'This is how I sound at this speed.', voice: 'chosen-voice', lengthScale: 0.8 });
// send sample.path to the user
Confirm with the user, then offer to save it permanently:
"Should I save this as your default speed? It'll be used automatically every session."
If the user agrees:
await saveConfig({ lengthScale: 0.8 });
Once saved, tts() reads it from config.json in the skill directory automatically — no need to pass lengthScale on every call.
Once confirmed, remember both voice and lengthScale for the session. Pass them to every subsequent tts() call unless the user asks to change them.
Always call status() before the first tts() call in a session to determine what is needed.
| stage | Meaning | What to do |
|---|---|---|
| ready | Fully installed, at least one voice model present | Proceed with tts() |
| not-setup | Piper not installed | Ask user for confirmation, then call setup() |
| no-piper | Venv exists but piper binary missing | Ask user for confirmation, then call setup() |
| no-model | Piper installed but no voice model downloaded | Follow Steps 3–5 of first-run flow above |
IMPORTANT: Always ask the user for confirmation before calling setup().
It installs the piper-tts package from PyPI into a venv inside the skill directory.
text, optional format ("ogg" or "wav"), optional voice (model stem), optional lengthScale (speech speed, default 1.0).ogg)To list installed voices, call listVoices() — returns stems of all installed .onnx models.
Never assume a fixed list; it varies per user and installation.
Auto-detection (no voice param):
The script detects language from the text using character and script analysis:
Auto-detection is best-effort. For reliable results with a specific language, always pass the voice parameter explicitly.
Explicit override: set PIPER_VOICE_MODEL env var to a full .onnx path (overrides everything).
When the user requests a specific voice or language:
listVoices() to see what is installedvoice to tts(), e.g. voice: "en_US-amy-medium"downloadVoices([stem])To switch back to auto-detect, omit the voice parameter.
The user may say things like "I don't like this voice, use a female one" or
"Download a German voice". When this happens:
de_DE-thorsten-medium) and call downloadVoices([stem])listVoices() — the new voice is immediately usableThe user may say "remove that voice" or "I don't need the German voice anymore". When this happens:
listVoices() to confirm which voices are installedremoveVoice(stem) — e.g. removeVoice('de_DE-thorsten-medium'){ removed, filesDeleted } on successNever remove the last remaining voice without warning the user that TTS will stop working.
The user may say things like "speak faster", "too slow", or "speed it up". When this happens:
lengthScale = 1 / (speed% / 100)await tts({ text: '...', voice: 'current-voice', lengthScale: 0.8 })saveConfig({ lengthScale: 0.8 })lengthScale for all subsequent tts() calls in the sessionOPENCLAW_WORKSPACE/tts/ if OPENCLAW_WORKSPACE env var is set~/.openclaw/workspace/tts/python3 (3.8+) — required for setup() to create the venvffmpeg — for WAV → OGG/Opus conversionespeak-ng — system library used by Piper internally; setup() checks for it and warns if missing. Install: sudo apt install espeak-ng (Debian/Ubuntu), sudo dnf install espeak-ng (Fedora),
brew install espeak (macOS)
.onnx + .onnx.json voice model pair in the skill directoryrm -rf ~/.openclaw/skills/local-piper-tts-multilang-secure
This removes everything: skill code, venv, and all voice models.
Generated Feb 27, 2026
Teachers and e-learning platforms can generate audio versions of study materials in multiple languages, enhancing accessibility for students with visual impairments or different learning preferences. The offline capability ensures privacy and reliability in schools with restricted internet access.
Businesses can integrate this skill into call centers or chatbots to provide voice responses in local languages without relying on cloud APIs, reducing costs and latency. It supports per-call voice selection for personalized interactions based on customer demographics.
Hospitals and clinics can use it to create audio instructions or reminders for patients in their native languages, improving comprehension and adherence to treatment plans. The offline operation ensures data security and compliance with health regulations like HIPAA.
Content creators and game developers can generate voiceovers for videos, podcasts, or interactive media in various languages, allowing for rapid prototyping and testing without external dependencies. The extensible model system supports niche languages for broader audience reach.
Manufacturers can embed this skill into smart speakers or home automation systems to provide offline voice feedback in multiple languages, enhancing user experience in areas with poor internet connectivity. The self-contained setup minimizes installation overhead on resource-constrained devices.
Offer a premium subscription for access to a curated library of high-quality or exclusive voice models, with regular updates and support. This generates recurring revenue while allowing users to download only the voices they need, reducing initial costs.
Sell enterprise licenses to large organizations for integrating the skill into their proprietary systems, with custom support, training, and SLA guarantees. This targets industries like healthcare and education where reliability and compliance are critical.
Provide a free version with basic voices and limited languages, while charging for advanced features such as faster processing, additional language packs, or priority downloads. This attracts individual users and small businesses before upselling to premium tiers.
💬 Integration Tip
Ensure ffmpeg and python3 are pre-installed on the target system, and guide users through the first-run flow step-by-step to avoid setup errors.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.