local-sttLocal STT with selectable backends - Parakeet (best accuracy) or Whisper (fastest, multilingual).
Install via ClawdBot CLI:
clawdbot install araa47/local-sttUnified local speech-to-text using ONNX Runtime with int8 quantization. Choose your backend:
# Default: Parakeet v2 (best English accuracy)
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg
# Explicit backend selection
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg -b whisper
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg -b parakeet -m v3
# Quiet mode (suppress progress)
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg --quiet
-b/--backend: parakeet (default), whisper-m/--model: Model variant (see below)--no-int8: Disable int8 quantization-q/--quiet: Suppress progress--room-id: Matrix room ID for direct message| Model | Description |
|-------|-------------|
| v2 (default) | English only, best accuracy |
| v3 | Multilingual |
| Model | Description |
|-------|-------------|
| tiny | Fastest, lower accuracy |
| base (default) | Good balance |
| small | Better accuracy |
| large-v3-turbo | Best quality, slower |
| Backend/Model | Time | RTF | Notes |
|---------------|------|-----|-------|
| Whisper Base int8 | 0.43s | 0.018x | Fastest |
| Parakeet v2 int8 | 0.60s | 0.025x | Best accuracy |
| Parakeet v3 int8 | 0.63s | 0.026x | Multilingual |
{
"tools": {
"media": {
"audio": {
"enabled": true,
"models": [
{
"type": "cli",
"command": "~/.openclaw/skills/local-stt/scripts/local-stt.py",
"args": ["--quiet", "{{MediaPath}}"],
"timeoutSeconds": 30
}
]
}
}
}
}
Generated Mar 1, 2026
Healthcare professionals use the skill to transcribe patient consultations and medical notes locally, ensuring privacy and accuracy with Parakeet's high English precision. It integrates into electronic health record systems via CLI, streamlining documentation workflows.
Call centers employ the skill with Whisper backend to transcribe support calls in multiple languages quickly, enabling real-time analysis and archiving. The fast inference supports high-volume environments while maintaining data security on-premises.
Educational institutions use the skill to generate accurate captions for lectures and presentations, leveraging Parakeet for clear English or Whisper for multilingual content. It aids accessibility and content creation without relying on cloud services.
Law firms and courts utilize the skill to transcribe legal proceedings and depositions locally, ensuring confidentiality and compliance with data regulations. Parakeet's accuracy captures nuanced legal terminology effectively.
Media companies integrate the skill into video editing pipelines to create subtitles for films and broadcasts, using Whisper for speed with multilingual support or Parakeet for high-quality English transcripts.
Offer the skill as a cloud-based or on-premise subscription service with tiered pricing based on usage volume and backend features. Revenue is generated through monthly fees from businesses needing reliable, private transcription.
Sell perpetual licenses to large organizations for integration into their internal systems, with support and customization options. Revenue comes from one-time license fees and ongoing maintenance contracts.
Provide a free basic version with limited backends or features, and charge for advanced options like multilingual models, higher accuracy, or priority support. Revenue is generated from upgrades and add-ons.
💬 Integration Tip
Ensure ffmpeg is installed and test with sample audio files first to verify backend performance before full deployment.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.