speech-to-textTranscribe audio to text with Whisper models via inference.sh CLI. Models: Fast Whisper Large V3, Whisper V3 Large. Capabilities: transcription, translation,...
Install via ClawdBot CLI:
clawdbot install okaris/speech-to-textTranscribe audio to text via inference.sh CLI.
curl -fsSL https://cli.inference.sh | sh && infsh login
infsh app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://audio.mp3"}'
Install note: The install script only detects your OS/architecture, downloads the matching binary from dist.inference.sh, and verifies its SHA-256 checksum. No elevated permissions or background processes. Manual install & verification available.
| Model | App ID | Best For |
|-------|--------|----------|
| Fast Whisper V3 | infsh/fast-whisper-large-v3 | Fast transcription |
| Whisper V3 Large | infsh/whisper-v3-large | Highest accuracy |
infsh app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://meeting.mp3"}'
infsh app sample infsh/fast-whisper-large-v3 --save input.json
# {
# "audio_url": "https://podcast.mp3",
# "timestamps": true
# }
infsh app run infsh/fast-whisper-large-v3 --input input.json
infsh app run infsh/whisper-v3-large --input '{
"audio_url": "https://french-audio.mp3",
"task": "translate"
}'
# Extract audio from video first
infsh app run infsh/video-audio-extractor --input '{"video_url": "https://video.mp4"}' > audio.json
# Transcribe the extracted audio
infsh app run infsh/fast-whisper-large-v3 --input '{"audio_url": "<audio-url>"}'
# 1. Transcribe video audio
infsh app run infsh/fast-whisper-large-v3 --input '{
"audio_url": "https://video.mp4",
"timestamps": true
}' > transcript.json
# 2. Use transcript for captions
infsh app run infsh/caption-videos --input '{
"video_url": "https://video.mp4",
"captions": "<transcript-from-step-1>"
}'
Whisper supports 99+ languages including:
English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, and many more.
Returns JSON with:
text: Full transcriptionsegments: Timestamped segments (if requested)language: Detected language# Full platform skill (all 150+ apps)
npx skills add inference-sh/skills@inference-sh
# Text-to-speech (reverse direction)
npx skills add inference-sh/skills@text-to-speech
# Video generation (add captions)
npx skills add inference-sh/skills@ai-video-generation
# AI avatars (lipsync with transcripts)
npx skills add inference-sh/skills@ai-avatar-video
Browse all audio apps: infsh app list --category audio
Generated Mar 1, 2026
Transcribe recorded meetings for businesses to create searchable archives and minutes. This enables efficient review and compliance documentation, especially useful for remote teams and legal proceedings.
Generate accurate transcripts for podcast episodes to improve accessibility and SEO. This helps content creators reach wider audiences, including those with hearing impairments, and enhances discoverability through text-based search.
Create synchronized subtitles for educational videos to support diverse learners and language accessibility. This aids in comprehension for non-native speakers and complies with accessibility standards in online courses.
Transcribe doctor's voice notes into structured text for electronic health records. This streamlines documentation, reduces manual entry errors, and improves patient record management in clinical settings.
Transcribe qualitative interviews for academic or market research projects. This facilitates data analysis, coding, and reporting by converting audio recordings into text for detailed review and insights extraction.
Offer a monthly subscription for API access to transcription services, with tiered pricing based on usage volume. This provides predictable revenue and caters to businesses needing regular transcription for meetings or content creation.
Provide free limited transcription with paid upgrades for higher accuracy, faster processing, or additional features like translation. This attracts individual users and small teams, converting them to paying customers as needs grow.
License the transcription technology to enterprises for integration into their own platforms, such as video conferencing tools or content management systems. This generates high-value contracts through customization and support services.
💬 Integration Tip
Use the provided CLI examples to test basic transcription first, then automate workflows by scripting the commands for batch processing or integrating with webhooks.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.