faster-whisperLocal speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. SRT...
Install via ClawdBot CLI:
clawdbot install ThePlasmak/faster-whisperLocal speech-to-text using faster-whisper ā a CTranslate2 reimplementation of OpenAI's Whisper that runs 4-6x faster with identical accuracy. With GPU acceleration, expect ~20x realtime transcription (a 10-minute audio file in ~30 seconds).
Use this skill when you need to:
--diarize)--rss fetches and transcribes episodes--translate--language-map assigns a different language per file--multilingual for mixed-language audio--initial-prompt for jargon-heavy content or any other terms to look out for--normalize and --denoise before transcription--stream shows segments as they're transcribed--clip-timestamps to transcribe specific sections--search "term" finds all timestamps where a word/phrase appears--detect-chapters finds section breaks from silence gaps--export-speakers DIR saves each speaker's turns as separate WAV files--format csv produces a properly-quoted CSV with timestampsTrigger phrases:
"transcribe this audio", "convert speech to text", "what did they say", "make a transcript",
"audio to text", "subtitle this video", "who's speaking", "translate this audio", "translate to English",
"find where X is mentioned", "search transcript for", "when did they say", "at what timestamp",
"add chapters", "detect chapters", "find breaks in the audio", "table of contents for this recording",
"TTML subtitles", "DFXP subtitles", "broadcast format subtitles", "Netflix format",
"ASS subtitles", "aegisub format", "advanced substation alpha", "mpv subtitles",
"LRC subtitles", "timed lyrics", "karaoke subtitles", "music player lyrics",
"HTML transcript", "confidence-colored transcript", "color-coded transcript",
"separate audio per speaker", "export speaker audio", "split by speaker",
"transcript as CSV", "spreadsheet output", "transcribe podcast", "podcast RSS feed",
"different languages in batch", "per-file language",
"transcribe in multiple formats", "srt and txt at the same time", "output both srt and text",
"remove filler words", "clean up ums and uhs", "strip hesitation sounds", "remove you know and I mean",
"transcribe left channel", "transcribe right channel", "stereo channel", "left track only",
"wrap subtitle lines", "character limit per line", "max chars per subtitle",
"detect paragraphs", "paragraph breaks", "group into paragraphs", "add paragraph spacing"
ā ļø Agent guidance ā keep invocations minimal:
_CORE RULE: default command (./scripts/transcribe audio.mp3) is the fastest path ā add flags only when the user explicitly asks for that capability._
Transcription:
--diarize if the user asks "who said what" / "identify speakers" / "label speakers"--format srt/vtt/ass/lrc/ttml if the user asks for subtitles/captions in that format--format csv if the user asks for CSV or spreadsheet output--word-timestamps if the user needs word-level timing--initial-prompt if there's domain-specific jargon to prime--translate if the user wants non-English audio translated to English--normalize/--denoise if the user mentions bad audio quality or noise--stream if the user wants live/progressive output for long files--clip-timestamps if the user wants a specific time range--temperature 0.0 if the model is hallucinating on music/silence--vad-threshold if VAD is aggressively cutting speech or including noise--min-speakers/--max-speakers when you know the speaker count--hf-token if the token is not cached at ~/.cache/huggingface/token--max-words-per-line for subtitle readability on long segments--filter-hallucinations if the transcript contains obvious artifacts (music markers, duplicates)--merge-sentences if the user asks for sentence-level subtitle cues--clean-filler if the user asks to remove filler words (um, uh, you know, I mean, hesitation sounds)--channel left|right if the user mentions stereo tracks, dual-channel recordings, or asks for a specific channel--max-chars-per-line N when the user specifies a character limit per subtitle line (e.g., "Netflix format", "42 chars per line"); takes priority over --max-words-per-line--detect-paragraphs if the user asks for paragraph breaks or structured text output; --paragraph-gap (default 3.0s) only if they want a custom gap--speaker-names "Alice,Bob" when the user provides real names to replace SPEAKER_1/2 ā always requires --diarize--hotwords WORDS when the user names specific rare terms not well served by --initial-prompt; prefer --initial-prompt for general domain jargon--prefix TEXT when the user knows the exact words the audio starts with--detect-language-only when the user only wants to identify the language, not transcribe--stats-file PATH if the user asks for performance stats, RTF, or benchmark info--parallel N for large CPU batch jobs; GPU handles one file efficiently on its own ā don't add for single files or small batches--retries N for unreliable inputs (URLs, network files) where transient failures are expected--burn-in OUTPUT only when user explicitly asks to embed/burn subtitles into the video; requires ffmpeg and a video file input--keep-temp when the user may re-process the same URL to avoid re-downloading--output-template when user specifies a custom naming pattern in batch mode--format srt,text): only when user explicitly wants multiple formats in one pass; always pair with -o --diarize adds ~20-30s on top of thatSearch:
--search "term" when the user asks to find/locate/search for a specific word or phrase in audio--search replaces the normal transcript output ā it prints only matching segments with timestamps--search-fuzzy only when the user mentions approximate/partial matching or typos-o results.txtChapter detection:
--detect-chapters when the user asks for chapters, sections, a table of contents, or "where does the topic change"--chapter-gap 8 (8-second silence = new chapter) works for most podcasts/lectures; tune down for dense content--chapter-format youtube (default) outputs YouTube-ready timestamps; use json for programmatic use--chapters-file PATH when combining chapters with a transcript output ā avoids mixing chapter markers into the transcript text-o /dev/null and use --chapters-file--chapters-file takes a single path ā in batch mode, each file's chapters overwrite the previous. For batch chapter detection, omit --chapters-file (chapters print to stdout under === CHAPTERS (N) ===) or use a separate run per fileSpeaker audio export:
--export-speakers DIR when the user explicitly asks to save each speaker's audio separately--diarize ā it silently skips if no speaker labels are presentSPEAKER_1.wav, SPEAKER_2.wav, etc. (or real names if --speaker-names is set)Language map:
--language-map in batch mode when the user has confirmed different languages across files"interview.mp3=en,lecture.mp3=fr" ā fnmatch globs on filename@/path/to/map.json where the file is {"pattern": "lang_code"}RSS / Podcast:
--rss URL when the user provides a podcast RSS feed URL--rss-latest 0 for all; --skip-existing to resume safely-o with --rss ā without it, all episode transcripts print to stdout concatenated, which is hard to use; each episode gets its own file when -o is setOutput format for agent relay:
--search) ā print directly to user; output is human-readable--chapters-file, chapters appear in stdout under === CHAPTERS (N) === header after the transcript; with --format json, chapters are also embedded in the JSON under "chapters" key-o file; tell the user the output path, never paste raw subtitle content-o file; tell the user the output path, don't paste raw XML/CSV/HTML--format srt,text) ā requires -o ; each format goes to a separate file; tell user all paths written--stats-file) ā summarise key fields (duration, processing time, RTF) for the user rather than pasting raw JSON--detect-language-only) ā print the result directly; it's a single lineWhen NOT to use:
faster-whisper vs whisperx:
This skill covers everything whisperx does ā diarization (--diarize), word-level timestamps (--word-timestamps), SRT/VTT subtitles ā so whisperx is not needed. Use whisperx only if you specifically need its pyannote pipeline or batch-GPU features not covered here.
| Task | Command | Notes |
| ------------------------------ | -------------------------------------------------------------------------------------- | --------------------------------------------------- |
| Basic transcription | ./scripts/transcribe audio.mp3 | Batched inference, VAD on, distil-large-v3.5 |
| SRT subtitles | ./scripts/transcribe audio.mp3 --format srt -o subs.srt | Word timestamps auto-enabled |
| VTT subtitles | ./scripts/transcribe audio.mp3 --format vtt -o subs.vtt | WebVTT format |
| Word timestamps | ./scripts/transcribe audio.mp3 --word-timestamps --format srt | wav2vec2 aligned (~10ms) |
| Speaker diarization | ./scripts/transcribe audio.mp3 --diarize | Requires pyannote.audio |
| Translate ā English | ./scripts/transcribe audio.mp3 --translate | Any language ā English |
| Stream output | ./scripts/transcribe audio.mp3 --stream | Live segments as transcribed |
| Clip time range | ./scripts/transcribe audio.mp3 --clip-timestamps "30,60" | Only 30sā60s |
| Denoise + normalize | ./scripts/transcribe audio.mp3 --denoise --normalize | Clean up noisy audio first |
| Reduce hallucination | ./scripts/transcribe audio.mp3 --hallucination-silence-threshold 1.0 | Skip hallucinated silence |
| YouTube/URL | ./scripts/transcribe https://youtube.com/watch?v=... | Auto-downloads via yt-dlp |
| Batch process | ./scripts/transcribe *.mp3 -o ./transcripts/ | Output to directory |
| Batch with skip | ./scripts/transcribe *.mp3 --skip-existing -o ./out/ | Resume interrupted batches |
| Domain terms | ./scripts/transcribe audio.mp3 --initial-prompt 'Kubernetes gRPC' | Boost rare terminology |
| Hotwords boost | ./scripts/transcribe audio.mp3 --hotwords 'JIRA Kubernetes' | Bias decoder toward specific words |
| Prefix conditioning | ./scripts/transcribe audio.mp3 --prefix 'Good morning,' | Seed the first segment with known opening words |
| Pin model version | ./scripts/transcribe audio.mp3 --revision v1.2.0 | Reproducible transcription with a pinned revision |
| Debug library logs | ./scripts/transcribe audio.mp3 --log-level debug | Show faster_whisper internal logs |
| Turbo model | ./scripts/transcribe audio.mp3 -m turbo | Alias for large-v3-turbo |
| Faster English | ./scripts/transcribe audio.mp3 --model distil-medium.en -l en | English-only, 6.8x faster |
| Maximum accuracy | ./scripts/transcribe audio.mp3 --model large-v3 --beam-size 10 | Full model |
| JSON output | ./scripts/transcribe audio.mp3 --format json -o out.json | Programmatic access with stats |
| Filter noise | ./scripts/transcribe audio.mp3 --min-confidence 0.6 | Drop low-confidence segments |
| Hybrid quantization | ./scripts/transcribe audio.mp3 --compute-type int8_float16 | Save VRAM, minimal quality loss |
| Reduce batch size | ./scripts/transcribe audio.mp3 --batch-size 4 | If OOM on GPU |
| TSV output | ./scripts/transcribe audio.mp3 --format tsv -o out.tsv | OpenAI Whisperācompatible TSV |
| Fix hallucinations | ./scripts/transcribe audio.mp3 --temperature 0.0 --no-speech-threshold 0.8 | Lock temperature + skip silence |
| Tune VAD sensitivity | ./scripts/transcribe audio.mp3 --vad-threshold 0.6 --min-silence-duration 500 | Tighter speech detection |
| Known speaker count | ./scripts/transcribe meeting.wav --diarize --min-speakers 2 --max-speakers 3 | Constrain diarization |
| Subtitle word wrapping | ./scripts/transcribe audio.mp3 --format srt --word-timestamps --max-words-per-line 8 | Split long cues |
| Private/gated model | ./scripts/transcribe audio.mp3 --hf-token hf_xxx | Pass token directly |
| Show version | ./scripts/transcribe --version | Print faster-whisper version |
| Upgrade in-place | ./setup.sh --update | Upgrade without full reinstall |
| System check | ./setup.sh --check | Verify GPU, Python, ffmpeg, venv, yt-dlp, pyannote |
| Detect language only | ./scripts/transcribe audio.mp3 --detect-language-only | Fast language ID, no transcription |
| Detect language JSON | ./scripts/transcribe audio.mp3 --detect-language-only --format json | Machine-readable language detection |
| LRC subtitles | ./scripts/transcribe audio.mp3 --format lrc -o lyrics.lrc | Timed lyrics format for music players |
| ASS subtitles | ./scripts/transcribe audio.mp3 --format ass -o subtitles.ass | Advanced SubStation Alpha (Aegisub, mpv, VLC) |
| Merge sentences | ./scripts/transcribe audio.mp3 --format srt --merge-sentences | Join fragments into sentence chunks |
| Stats sidecar | ./scripts/transcribe audio.mp3 --stats-file stats.json | Write perf stats JSON after transcription |
| Batch stats | ./scripts/transcribe *.mp3 --stats-file ./stats/ | One stats file per input in dir |
| Template naming | ./scripts/transcribe audio.mp3 -o ./out/ --output-template "{stem}_{lang}.{ext}" | Custom batch output filenames |
| Stdin input | ffmpeg -i input.mp4 -f wav - \| ./scripts/transcribe - | Pipe audio directly from stdin |
| Custom model dir | ./scripts/transcribe audio.mp3 --model-dir ~/my-models | Custom HuggingFace cache dir |
| Local model | ./scripts/transcribe audio.mp3 -m ./my-model-ct2 | CTranslate2 model dir |
| HTML transcript | ./scripts/transcribe audio.mp3 --format html -o out.html | Confidence-colored |
| Burn subtitles | ./scripts/transcribe video.mp4 --burn-in output.mp4 | Requires ffmpeg + video input |
| Name speakers | ./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob" | Replaces SPEAKER_1/2 |
| Filter hallucinations | ./scripts/transcribe audio.mp3 --filter-hallucinations | Removes artifacts |
| Keep temp files | ./scripts/transcribe https://... --keep-temp | For URL re-processing |
| Parallel batch | ./scripts/transcribe *.mp3 --parallel 4 -o ./out/ | CPU multi-file |
| RTX 3070 recommended | ./scripts/transcribe audio.mp3 --compute-type int8_float16 | Saves ~1GB VRAM, minimal quality loss |
| CPU thread count | ./scripts/transcribe audio.mp3 --threads 8 | Force CPU thread count (default: auto) |
| Podcast RSS (latest 5) | ./scripts/transcribe --rss https://feeds.example.com/podcast.xml | Downloads & transcribes newest 5 episodes |
| Podcast RSS (all episodes) | ./scripts/transcribe --rss https://... --rss-latest 0 -o ./episodes/ | All episodes, one file each |
| Podcast + SRT subtitles | ./scripts/transcribe --rss https://... --format srt -o ./subs/ | Subtitle all episodes |
| Retry on failure | ./scripts/transcribe *.mp3 --retries 3 -o ./out/ | Retry up to 3Ć with backoff on error |
| CSV output | ./scripts/transcribe audio.mp3 --format csv -o out.csv | Spreadsheet-ready with header row; properly quoted |
| CSV with speakers | ./scripts/transcribe audio.mp3 --diarize --format csv -o out.csv | Adds speaker column |
| Language map (inline) | ./scripts/transcribe .mp3 --language-map "interview.mp3=en,lecture.wav=fr" | Per-file language in batch |
| Language map (JSON) | ./scripts/transcribe *.mp3 --language-map @langs.json | JSON file: {"pattern": "lang"} |
| Batch with ETA | ./scripts/transcribe *.mp3 -o ./out/ | Automatic ETA shown for each file in batch |
| TTML subtitles | ./scripts/transcribe audio.mp3 --format ttml -o subtitles.ttml | Broadcast-standard DFXP/TTML (Netflix, BBC, Amazon) |
| TTML with speaker labels | ./scripts/transcribe audio.mp3 --diarize --format ttml -o subtitles.ttml | Speaker-labeled TTML |
| Search transcript | ./scripts/transcribe audio.mp3 --search "keyword" | Find timestamps where keyword appears |
| Search to file | ./scripts/transcribe audio.mp3 --search "keyword" -o results.txt | Save search results |
| Fuzzy search | ./scripts/transcribe audio.mp3 --search "aproximate" --search-fuzzy | Approximate/partial matching |
| Detect chapters | ./scripts/transcribe audio.mp3 --detect-chapters | Auto-detect chapters from silence gaps |
| Chapter gap tuning | ./scripts/transcribe audio.mp3 --detect-chapters --chapter-gap 5 | Chapters on gaps ā„5s (default: 8s) |
| Chapters to file | ./scripts/transcribe audio.mp3 --detect-chapters --chapters-file ch.txt | Save YouTube-format chapter list |
| Chapters JSON | ./scripts/transcribe audio.mp3 --detect-chapters --chapter-format json | Machine-readable chapter list |
| Export speaker audio | ./scripts/transcribe audio.mp3 --diarize --export-speakers ./speakers/ | Save each speaker's audio to separate WAV files |
| Multi-format output | ./scripts/transcribe audio.mp3 --format srt,text -o ./out/ | Write SRT + TXT in one pass |
| Remove filler words | ./scripts/transcribe audio.mp3 --clean-filler | Strip um/uh/er/ah/hmm and discourse markers |
| Left channel only | ./scripts/transcribe audio.mp3 --channel left | Extract left stereo channel before transcribing |
| Right channel only | ./scripts/transcribe audio.mp3 --channel right | Extract right stereo channel |
| Max chars per line | ./scripts/transcribe audio.mp3 --format srt --max-chars-per-line 42 | Character-based subtitle wrapping |
| Detect paragraphs | ./scripts/transcribe audio.mp3 --detect-paragraphs | Insert paragraph breaks in text output |
| Paragraph gap tuning | ./scripts/transcribe audio.mp3 --detect-paragraphs --paragraph-gap 5.0 | Tune gap threshold (default 3.0s) |
Choose the right model for your needs:
digraph model_selection {
rankdir=LR;
node [shape=box, style=rounded];
start [label="Start", shape=doublecircle];
need_accuracy [label="Need maximum\naccuracy?", shape=diamond];
multilingual [label="Multilingual\ncontent?", shape=diamond];
resource_constrained [label="Resource\nconstraints?", shape=diamond];
large_v3 [label="large-v3\nor\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
distil_large [label="distil-large-v3.5\n(default)", style="rounded,filled", fillcolor=lightgreen];
distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];
start -> need_accuracy;
need_accuracy -> large_v3 [label="yes"];
need_accuracy -> multilingual [label="no"];
multilingual -> large_turbo [label="yes"];
multilingual -> resource_constrained [label="no (English)"];
resource_constrained -> distil_small [label="mobile/edge"];
resource_constrained -> distil_medium [label="some limits"];
resource_constrained -> distil_large [label="no"];
}
| Model | Size | Speed | Accuracy | Use Case |
| ---------------------- | ----- | --------- | --------- | ---------------------------------- |
| tiny / tiny.en | 39M | Fastest | Basic | Quick drafts |
| base / base.en | 74M | Very fast | Good | General use |
| small / small.en | 244M | Fast | Better | Most tasks |
| medium / medium.en | 769M | Moderate | High | Quality transcription |
| large-v1/v2/v3 | 1.5GB | Slower | Best | Maximum accuracy |
| large-v3-turbo | 809M | Fast | Excellent | High accuracy (slower than distil) |
| Model | Size | Speed vs Standard | Accuracy | Use Case |
| ----------------------- | ---- | ----------------- | --------- | ---------------------------------- |
| distil-large-v3.5 | 756M | ~6.3x faster | 7.08% WER | Default, best balance |
| distil-large-v3 | 756M | ~6.3x faster | 7.53% WER | Previous default |
| distil-large-v2 | 756M | ~5.8x faster | 10.1% WER | Fallback |
| distil-medium.en | 394M | ~6.8x faster | 11.1% WER | English-only, resource-constrained |
| distil-small.en | 166M | ~5.6x faster | 12.1% WER | Mobile/edge devices |
.en models are English-only and slightly faster/better for English content.
Note for distil models: HuggingFace recommends disablingcondition_on_previous_textfor all distil models to prevent repetition loops. The script auto-applies--no-condition-on-previous-textwhenever adistil-*model is detected. Pass--condition-on-previous-textto override if needed.
WhisperModel accepts local CTranslate2 model directories and HuggingFace repo names ā no code changes needed.
./scripts/transcribe audio.mp3 --model /path/to/my-model-ct2
pip install ctranslate2
ct2-transformers-converter \
--model openai/whisper-large-v3 \
--output_dir whisper-large-v3-ct2 \
--copy_files tokenizer.json preprocessor_config.json \
--quantization float16
./scripts/transcribe audio.mp3 --model ./whisper-large-v3-ct2
./scripts/transcribe audio.mp3 --model username/whisper-large-v3-ct2
By default, models are cached in ~/.cache/huggingface/. Use --model-dir to override:
./scripts/transcribe audio.mp3 --model-dir ~/my-models
# Base install (creates venv, installs deps, auto-detects GPU)
./setup.sh
# With speaker diarization support
./setup.sh --diarize
Requirements:
--burn-in, --normalize, and --denoise.--diarize, installed via setup.sh --diarize)| Platform | Acceleration | Speed |
| ---------------------- | ------------ | ---------------- |
| Linux + NVIDIA GPU | CUDA | ~20x realtime š |
| WSL2 + NVIDIA GPU | CUDA | ~20x realtime š |
| macOS Apple Silicon | CPU\* | ~3-5x realtime |
| macOS Intel | CPU | ~1-2x realtime |
| Linux (no GPU) | CPU | ~1x realtime |
\*faster-whisper uses CTranslate2 which is CPU-only on macOS, but Apple Silicon is fast enough for practical use.
The setup script auto-detects your GPU and installs PyTorch with CUDA. Always use GPU if available ā CPU transcription is extremely slow.
| Hardware | Speed | 9-min video |
| -------------- | -------------- | ----------- |
| RTX 3070 (GPU) | ~20x realtime | ~27 sec |
| CPU (int8) | ~0.3x realtime | ~30 min |
RTX 3070 tip: Use --compute-type int8_float16 for hybrid quantization ā saves ~1GB VRAM with minimal quality loss. Ideal for running diarization alongside transcription.
If setup didn't detect your GPU, manually install PyTorch with CUDA:
# For CUDA 12.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121
# For CUDA 11.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118
# Basic transcription
./scripts/transcribe audio.mp3
# SRT subtitles
./scripts/transcribe audio.mp3 --format srt -o subtitles.srt
# WebVTT subtitles
./scripts/transcribe audio.mp3 --format vtt -o subtitles.vtt
# Transcribe from YouTube URL
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ --language en
# Speaker diarization
./scripts/transcribe meeting.wav --diarize
# Diarized VTT subtitles
./scripts/transcribe meeting.wav --diarize --format vtt -o meeting.vtt
# Prime with domain terminology
./scripts/transcribe lecture.mp3 --initial-prompt "Kubernetes, gRPC, PostgreSQL, NGINX"
# Batch process a directory
./scripts/transcribe ./recordings/ -o ./transcripts/
# Batch with glob, skip already-done files
./scripts/transcribe *.mp3 --skip-existing -o ./transcripts/
# Filter low-confidence segments
./scripts/transcribe noisy-audio.mp3 --min-confidence 0.6
# JSON output with full metadata
./scripts/transcribe audio.mp3 --format json -o result.json
# Specify language (faster than auto-detect)
./scripts/transcribe audio.mp3 --language en
Input:
AUDIO Audio file(s), directory, glob pattern, or URL
Accepts: mp3, wav, m4a, flac, ogg, webm, mp4, mkv, avi, wma, aac
URLs auto-download via yt-dlp (YouTube, direct links, etc.)
Model & Language:
-m, --model NAME Whisper model (default: distil-large-v3.5; "turbo" = large-v3-turbo)
--revision REV Model revision (git branch/tag/commit) to pin a specific version
-l, --language CODE Language code, e.g. en, es, fr (auto-detects if omitted)
--initial-prompt TEXT Prompt to condition the model (terminology, formatting style)
--prefix TEXT Prefix to condition the first segment (e.g. known starting words)
--hotwords WORDS Space-separated hotwords to boost recognition
--translate Translate any language to English (instead of transcribing)
--multilingual Enable multilingual/code-switching mode (helps smaller models)
--hf-token TOKEN HuggingFace token for private/gated models and diarization
--model-dir PATH Custom model cache directory (default: ~/.cache/huggingface/)
Output Format:
-f, --format FMT text | json | srt | vtt | tsv | lrc | html | ass | ttml (default: text)
Accepts comma-separated list: --format srt,text writes both in one pass
Multi-format requires -o <dir> when saving to files
--word-timestamps Include word-level timestamps (wav2vec2 aligned automatically)
--stream Output segments as they are transcribed (disables diarize/alignment)
--max-words-per-line N For SRT/VTT, split segments into sub-cues of at most N words
--max-chars-per-line N For SRT/VTT/ASS/TTML, split lines so each fits within N characters
Takes priority over --max-words-per-line when both are set
--clean-filler Remove hesitation fillers (um, uh, er, ah, hmm, hm) and discourse markers
(you know, I mean, you see) from transcript text. Off by default.
--detect-paragraphs Insert paragraph breaks (blank lines) in text output at natural boundaries.
A new paragraph starts when: silence gap ā„ --paragraph-gap, OR the previous
segment ends a sentence AND the gap ā„ 1.5s.
--paragraph-gap SEC Minimum silence gap in seconds to start a new paragraph (default: 3.0).
Used with --detect-paragraphs.
--channel {left,right,mix}
Stereo channel to transcribe: left (c0), right (c1), or mix (default: mix).
Extracts the channel via ffmpeg before transcription. Requires ffmpeg.
--merge-sentences Merge consecutive segments into sentence-level chunks
(improves SRT/VTT readability; groups by terminal punctuation or >2s gap)
-o, --output PATH Output file or directory (directory for batch mode)
--output-template TEMPLATE
Batch output filename template. Variables: {stem}, {lang}, {ext}, {model}
Example: "{stem}_{lang}.{ext}" ā "interview_en.srt"
Inference Tuning:
--beam-size N Beam search size; higher = more accurate but slower (default: 5)
--temperature T Sampling temperature or comma-separated fallback list, e.g.
'0.0' or '0.0,0.2,0.4' (default: faster-whisper's schedule)
--no-speech-threshold PROB
Probability threshold to mark segments as silence (default: 0.6)
--batch-size N Batched inference batch size (default: 8; reduce if OOM)
--no-vad Disable voice activity detection (on by default)
--vad-threshold T VAD speech probability threshold (default: 0.5)
--vad-neg-threshold T VAD negative threshold for ending speech (default: auto)
--vad-onset T Alias for --vad-threshold (legacy)
--vad-offset T Alias for --vad-neg-threshold (legacy)
--min-speech-duration MS Minimum speech segment duration in ms (default: 0)
--max-speech-duration SEC Maximum speech segment duration in seconds (default: unlimited)
--min-silence-duration MS Minimum silence before splitting a segment in ms (default: 2000)
--speech-pad MS Padding around speech segments in ms (default: 400)
--no-batch Disable batched inference (use standard WhisperModel)
--hallucination-silence-threshold SEC
Skip silent sections where model hallucinates (e.g. 1.0)
--no-condition-on-previous-text
Don't condition on previous text (reduces repetition/hallucination loops;
auto-enabled for distil models per HuggingFace recommendation)
--condition-on-previous-text
Force-enable conditioning on previous text (overrides auto-disable for distil models)
--compression-ratio-threshold RATIO
Filter segments above this compression ratio (default: 2.4)
--log-prob-threshold PROB
Filter segments below this avg log probability (default: -1.0)
--max-new-tokens N Maximum tokens per segment (prevents runaway generation)
--clip-timestamps RANGE
Transcribe specific time ranges: '30,60' or '0,30;60,90' (seconds)
--progress Show transcription progress bar
--best-of N Candidates when sampling with non-zero temperature (default: 5)
--patience F Beam search patience factor (default: 1.0)
--repetition-penalty F Penalty for repeated tokens (default: 1.0)
--no-repeat-ngram-size N Prevent n-gram repetitions of this size (default: 0 = off)
Advanced Inference:
--no-timestamps Output text without timing info (faster; incompatible with
--word-timestamps, --format srt/vtt/tsv, --diarize)
--chunk-length N Audio chunk length in seconds for batched inference (default: auto)
--language-detection-threshold T
Confidence threshold for language auto-detection (default: 0.5)
--language-detection-segments N
Audio segments to sample for language detection (default: 1)
--length-penalty F Beam search length penalty; >1 favors longer, <1 favors shorter (default: 1.0)
--prompt-reset-on-temperature T
Reset initial prompt when temperature fallback hits threshold (default: 0.5)
--no-suppress-blank Disable blank token suppression (may help soft/quiet speech)
--suppress-tokens IDS Comma-separated token IDs to suppress in addition to default -1
--max-initial-timestamp T
Maximum timestamp for the first segment in seconds (default: 1.0)
--prepend-punctuations CHARS
Punctuation characters merged into preceding word (default: "'Āæ([{-)
--append-punctuations CHARS
Punctuation characters merged into following word (default: "'.ć,ļ¼!ļ¼?ļ¼:ļ¼")]}ć")
Preprocessing:
--normalize Normalize audio volume (EBU R128 loudnorm) before transcription
--denoise Apply noise reduction (high-pass + FFT denoise) before transcription
Advanced:
--diarize Speaker diarization (requires pyannote.audio)
--min-speakers N Minimum number of speakers hint for diarization
--max-speakers N Maximum number of speakers hint for diarization
--speaker-names NAMES Comma-separated names to replace SPEAKER_1, SPEAKER_2 (e.g. 'Alice,Bob')
Requires --diarize
--min-confidence PROB Filter segments below this avg word confidence (0.0ā1.0)
--skip-existing Skip files whose output already exists (batch mode)
--detect-language-only
Detect language and exit (no transcription). Output: "Language: en (probability: 0.984)"
With --format json: {"language": "en", "language_probability": 0.984}
--stats-file PATH Write JSON stats sidecar after transcription (processing time, RTF, word count, etc.)
Directory path ā writes {stem}.stats.json inside; file path ā exact path
--burn-in OUTPUT Burn subtitles into the original video (single-file mode only; requires ffmpeg)
--filter-hallucinations
Filter common Whisper hallucinations: music/applause markers, duplicate segments,
'Thank you for watching', lone punctuation, etc.
--keep-temp Keep temp files from URL downloads (useful for re-processing without re-downloading)
--parallel N Number of parallel workers for batch processing (default: sequential)
--retries N Retry failed files up to N times with exponential backoff (default: 0;
incompatible with --parallel)
Batch ETA:
Automatically shown for sequential batch jobs (no flag needed). After each file completes,
the next file's progress line includes: [current/total] filename | ETA: Xm Ys
ETA is calculated from average time per file Ć remaining files.
Shown to stderr (surfaced to users via OpenClaw/Clawdbot output).
Language Map (per-file language override):
--language-map MAP Per-file language override for batch mode. Two forms:
Inline: "interview*.mp3=en,lecture.wav=fr,keynote.wav=de"
JSON file: "@/path/to/map.json" (must be {pattern: lang} dict)
Patterns support fnmatch globs on filename or stem.
Priority: exact filename > exact stem > glob on filename > glob on stem > fallback.
Files not matched fall back to --language (or auto-detect if not set).
Transcript Search:
--search TERM Search the transcript for TERM and print matching segments with timestamps.
Replaces normal transcript output (use -o to save results to a file).
Case-insensitive exact substring match by default.
--search-fuzzy Enable fuzzy/approximate matching with --search (useful for typos, phonetic
near-misses, or partial words; uses SequenceMatcher ratio ā„ 0.6)
Chapter Detection:
--detect-chapters Auto-detect chapter/section breaks from silence gaps and print chapter markers.
Output is printed after the transcript (or to --chapters-file).
--chapter-gap SEC Minimum silence gap in seconds between consecutive segments to start a new
chapter (default: 8.0). Tune down for dense speech, up for sparse content.
--chapters-file PATH Write chapter markers to this file (default: stdout after transcript)
--chapter-format FMT youtube | text | json ā chapter output format:
youtube: "0:00 Chapter 1" (YouTube description ready)
text: "Chapter 1: 00:00:00"
json: JSON array with chapter, start, title fields
(default: youtube)
Speaker Audio Export:
--export-speakers DIR After diarization, export each speaker's audio turns concatenated into
separate WAV files saved in DIR. Requires --diarize and ffmpeg.
Output: SPEAKER_1.wav, SPEAKER_2.wav, ⦠(or real names if --speaker-names set)
RSS / Podcast:
--rss URL Podcast RSS feed URL ā extracts audio enclosures and transcribes them.
AUDIO positional is optional when --rss is used.
--rss-latest N Number of most-recent episodes to process (default: 5; 0 = all episodes)
Device:
--device DEV auto | cpu | cuda (default: auto)
--compute-type TYPE auto | int8 | int8_float16 | float16 | float32 (default: auto)
int8_float16 = hybrid mode for GPU (saves VRAM, minimal quality loss)
--threads N CPU thread count for CTranslate2 (default: auto)
-q, --quiet Suppress progress and status messages
--log-level LEVEL Set faster_whisper library logging level: debug | info | warning | error
(default: warning; use debug to see CTranslate2/VAD internals)
Utility:
--version Print installed faster-whisper version and exit
--update Upgrade faster-whisper in the skill venv and exit
Plain transcript text. With --diarize, speaker labels are inserted:
[SPEAKER_1]
Hello, welcome to the meeting.
[SPEAKER_2]
Thanks for having me.
--format json)Full metadata including segments, timestamps, language detection, and performance stats:
{
"file": "audio.mp3",
"text": "Hello, welcome...",
"language": "en",
"language_probability": 0.98,
"duration": 600.5,
"segments": [...],
"speakers": ["SPEAKER_1", "SPEAKER_2"],
"stats": {
"processing_time": 28.3,
"realtime_factor": 21.2
}
}
--format srt)Standard subtitle format for video players:
1
00:00:00,000 --> 00:00:02,500
[SPEAKER_1] Hello, welcome to the meeting.
2
00:00:02,800 --> 00:00:04,200
[SPEAKER_2] Thanks for having me.
--format vtt)WebVTT format for web video players:
WEBVTT
1
00:00:00.000 --> 00:00:02.500
[SPEAKER_1] Hello, welcome to the meeting.
2
00:00:02.800 --> 00:00:04.200
[SPEAKER_2] Thanks for having me.
--format tsv)Tab-separated values, OpenAI Whisperācompatible. Columns: start_ms, end_ms, text:
0 2500 Hello, welcome to the meeting.
2800 4200 Thanks for having me.
Useful for piping into other tools or spreadsheets. No header row.
--format ass)Advanced SubStation Alpha format ā supported by Aegisub, VLC, mpv, MPC-HC, and most video editors. Offers richer styling than SRT (font, size, color, position) via the [V4+ Styles] section:
[Script Info]
ScriptType: v4.00+
...
[V4+ Styles]
Style: Default,Arial,20,&H00FFFFFF,...
[Events]
Format: Layer, Start, End, Style, Name, ..., Text
Dialogue: 0,0:00:00.00,0:00:02.50,Default,,[SPEAKER_1] Hello, welcome.
Dialogue: 0,0:00:02.80,0:00:04.20,Default,,[SPEAKER_2] Thanks for having me.
Timestamps use H:MM:SS.cc (centiseconds). Edit the [V4+ Styles] block in Aegisub to customise font, color, and position without re-transcribing.
--format lrc)Timed lyrics format used by music players (e.g., Foobar2000, VLC, AIMP). Timestamps use [mm:ss.xx] where xx = centiseconds:
[00:00.50]Hello, welcome to the meeting.
[00:02.80]Thanks for having me.
With diarization, speaker labels are included:
[00:00.50][SPEAKER_1] Hello, welcome to the meeting.
[00:02.80][SPEAKER_2] Thanks for having me.
Default file extension: .lrc. Useful for music transcription, karaoke, and any workflow requiring timed text with music-player compatibility.
Identifies who spoke when using pyannote.audio.
Setup:
./setup.sh --diarize
Requirements:
~/.cache/huggingface/token (huggingface-cli login)Usage:
# Basic diarization (text output)
./scripts/transcribe meeting.wav --diarize
# Diarized subtitles
./scripts/transcribe meeting.wav --diarize --format srt -o meeting.srt
# Diarized JSON (includes speakers list)
./scripts/transcribe meeting.wav --diarize --format json
Speakers are labeled SPEAKER_1, SPEAKER_2, etc. in order of first appearance. Diarization runs on GPU automatically if CUDA is available.
Whenever word-level timestamps are computed (--word-timestamps, --diarize, or --min-confidence), a wav2vec2 forced alignment pass automatically refines them from Whisper's ~100-200ms accuracy to ~10ms. No extra flag needed.
# Word timestamps with automatic wav2vec2 alignment
./scripts/transcribe audio.mp3 --word-timestamps --format json
# Diarization also gets precise alignment automatically
./scripts/transcribe meeting.wav --diarize
# Precise subtitles
./scripts/transcribe audio.mp3 --word-timestamps --format srt -o subtitles.srt
Uses the MMS (Massively Multilingual Speech) model from torchaudio ā supports 1000+ languages. The model is cached after first load, so batch processing stays fast.
Pass any URL as input ā audio is downloaded automatically via yt-dlp:
# YouTube video
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ
# Direct audio URL
./scripts/transcribe https://example.com/podcast.mp3
# With options
./scripts/transcribe https://youtube.com/watch?v=... --language en --format srt -o subs.srt
Requires yt-dlp (checks PATH and ~/.local/share/pipx/venvs/yt-dlp/bin/yt-dlp).
Process multiple files at once with glob patterns, directories, or multiple paths:
# All MP3s in current directory
./scripts/transcribe *.mp3
# Entire directory (auto-filters audio files)
./scripts/transcribe ./recordings/
# Output to directory (one file per input)
./scripts/transcribe *.mp3 -o ./transcripts/
# Skip already-transcribed files (resume interrupted batch)
./scripts/transcribe *.mp3 --skip-existing -o ./transcripts/
# Mixed inputs
./scripts/transcribe file1.mp3 file2.wav ./more-recordings/
# Batch SRT subtitles
./scripts/transcribe *.mp3 --format srt -o ./subtitles/
When outputting to a directory, files are named {input-stem}.{ext} (e.g., audio.mp3 ā audio.srt).
Batch mode prints a summary after all files complete:
š Done: 12 files, 3h24m audio in 10m15s (19.9Ć realtime)
End-to-end pipelines for common use cases.
Fetch and transcribe the latest 5 episodes from any podcast RSS feed:
# Transcribe latest 5 episodes ā one .txt per episode
./scripts/transcribe --rss https://feeds.megaphone.fm/mypodcast -o ./transcripts/
# All episodes, as SRT subtitles
./scripts/transcribe --rss https://... --rss-latest 0 --format srt -o ./subtitles/
# Skip already-done episodes (safe to re-run)
./scripts/transcribe --rss https://... --skip-existing -o ./transcripts/
# With diarization (who said what) + retry on flaky network
./scripts/transcribe --rss https://... --diarize --retries 2 -o ./transcripts/
Transcribe a meeting recording with speaker labels, then output clean text:
# Diarize + name speakers (replace SPEAKER_1/2 with real names)
./scripts/transcribe meeting.wav --diarize --speaker-names "Alice,Bob" -o meeting.txt
# Diarized JSON for post-processing (summaries, action items)
./scripts/transcribe meeting.wav --diarize --format json -o meeting.json
# Stream live while it transcribes (long meetings)
./scripts/transcribe meeting.wav --stream
Generate ready-to-use subtitles for a video file:
# SRT subtitles with sentence merging (better readability)
./scripts/transcribe video.mp4 --format srt --merge-sentences -o subtitles.srt
# Burn subtitles directly into the video
./scripts/transcribe video.mp4 --format srt --burn-in video_subtitled.mp4
# Word-level SRT (karaoke-style), capped at 8 words per cue
./scripts/transcribe video.mp4 --format srt --word-timestamps --max-words-per-line 8 -o subs.srt
Transcribe multiple YouTube videos at once:
# One-liner: transcribe a playlist video + output SRT
./scripts/transcribe "https://youtube.com/watch?v=abc123" --format srt -o subs.srt
# Batch from a text file of URLs (one per line)
cat urls.txt | xargs ./scripts/transcribe -o ./transcripts/
# Download audio first, then transcribe (for re-use without re-downloading)
./scripts/transcribe https://youtube.com/watch?v=abc123 --keep-temp
Clean up poor-quality recordings before transcribing:
# Denoise + normalize, then transcribe
./scripts/transcribe interview.mp3 --denoise --normalize -o interview.txt
# Noisy batch with aggressive hallucination filtering
./scripts/transcribe *.mp3 --denoise --filter-hallucinations -o ./out/
Process a large folder with retries ā safe to re-run after failures:
# Retry each failed file up to 3 times, skip already-done
./scripts/transcribe ./recordings/ --skip-existing --retries 3 -o ./transcripts/
# Check what failed (printed in batch summary at the end)
# Re-run the same command ā skips successes, retries failures
speaches runs faster-whisper as an OpenAI-compatible /v1/audio/transcriptions endpoint ā drop-in replacement for OpenAI Whisper API with streaming, Docker support, and live transcription.
docker run --gpus all -p 8000:8000 ghcr.io/speaches-ai/speaches:latest-cuda
# Transcribe a file via the API (same format as OpenAI)
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.mp3 \
-F model=Systran/faster-whisper-large-v3
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000", api_key="none")
with open("audio.mp3", "rb") as f:
result = client.audio.transcriptions.create(model="Systran/faster-whisper-large-v3", file=f)
print(result.text)
Useful when you want to expose transcription as a local API for other tools (Home Assistant, n8n, custom apps).
| Mistake | Problem | Solution |
| --------------------------------------------------- | ------------------------------------------- | -------------------------------------------------------------------------------- |
| Using CPU when GPU available | 10-20x slower transcription | Check nvidia-smi; verify CUDA installation |
| Not specifying language | Wastes time auto-detecting on known content | Use --language en when you know the language |
| Using wrong model | Unnecessary slowness or poor accuracy | Default distil-large-v3.5 is excellent; only use large-v3 if accuracy issues |
| Ignoring distilled models | Missing 6x speedup with <1% accuracy loss | Try distil-large-v3.5 before reaching for standard models |
| Forgetting ffmpeg | Setup fails or audio can't be processed | Setup script handles this; manual installs need ffmpeg separately |
| Out of memory errors | Model too large for available VRAM/RAM | Use smaller model, --compute-type int8, or --batch-size 4 |
| Over-engineering beam size | Diminishing returns past beam-size 5-7 | Default 5 is fine; try 10 for critical transcripts |
| --diarize without pyannote | Import error at runtime | Run setup.sh --diarize first |
| --diarize without HuggingFace token | Model download fails | Run huggingface-cli login and accept model agreements |
| URL input without yt-dlp | Download fails | Install: pipx install yt-dlp |
| --min-confidence too high | Drops good segments with natural pauses | Start at 0.5, adjust up; check JSON output for probabilities |
| Using --word-timestamps for basic transcription | Adds ~5-10s overhead for negligible benefit | Only use when word-level precision matters |
| Batch without -o directory | All output mixed in stdout | Use -o ./transcripts/ to write one file per input |
~/.cache/huggingface/ (one-time)BatchedInferencePipeline ā ~3x faster than standard mode; VAD on by defaultdistil-large-v3: ~2GB RAM / ~1GB VRAMlarge-v3-turbo: ~4GB RAM / ~2GB VRAMtiny/base: <1GB RAM--batch-size (try 4) if you hit out-of-memory errorsffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav converts to 16kHz mono WAV before transcription. Benefit is minimal (~5%) for one-off use since PyAV decodes efficiently ā most useful when re-processing the same file multiple times (research/experiments) or when a format causes PyAV decode issues. Note: --normalize and --denoise already perform this conversion automatically../setup.sh --update to get it.BatchedInferencePipeline (used by default). Upgrade with ./setup.sh --update to get this if you installed before August 2024.--speaker-names maps to real names--keep-temp preserves downloads for re-use--model-dir controls cache--filter-hallucinations strips music/applause markers and duplicates--parallel N for multi-threaded batch processing--burn-in overlays subtitles directly into video via ffmpegMulti-format output:
--format srt,text ā write multiple formats in one pass (e.g. SRT + plain text simultaneously)srt,vtt,json, srt,text, etc.-o when writing multiple formats; single format unchangedFiller word removal:
--clean-filler ā strip hesitation sounds (um, uh, er, ah, hmm, hm) and discourse markers(you know, I mean, you see) from transcript text; off by default
Stereo channel selection:
--channel left|right|mix ā extract a specific stereo channel before transcribing (default: mix)Character-based subtitle wrapping:
--max-chars-per-line N ā split subtitle cues so each line fits within N characters--max-words-per-lineParagraph detection:
--detect-paragraphs ā insert \n\n paragraph breaks in text output at natural boundaries--paragraph-gap SEC ā minimum silence gap for a paragraph (default: 3.0s)Subtitle formats:
--format ass ā Advanced SubStation Alpha (Aegisub, VLC, mpv, MPC-HC)--format lrc ā Timed lyrics format for music players--format html ā Confidence-colored HTML transcript (green/yellow/red per word)--format ttml ā W3C TTML 1.0 (DFXP) broadcast standard (Netflix, Amazon Prime, BBC)--format csv ā Spreadsheet-ready CSV with header row; RFC 4180 quoting; speaker column when diarizedTranscript tools:
--search TERM ā Find all timestamps where a word/phrase appears; replaces normal output; -o to save--search-fuzzy ā Approximate/partial matching with --search--detect-chapters ā Auto-detect chapter breaks from silence gaps; --chapter-gap SEC (default 8s)--chapters-file PATH ā Write chapters to file instead of stdout; --chapter-format youtube|text|json--export-speakers DIR ā After --diarize, save each speaker's turns as separate WAV files via ffmpegBatch improvements:
[N/total] filename | ETA: Xm Ys shown before each file in sequential batch; no flag needed--language-map "pat=lang,..." ā Per-file language override; fnmatch glob patterns; @file.json form--retries N ā Retry failed files with exponential backoff; failed-file summary at end--rss URL ā Transcribe podcast RSS feeds; --rss-latest N for episode count--skip-existing / --parallel N / --output-template / --stats-file / --merge-sentencesModel & inference:
distil-large-v3.5 default (replaced distil-large-v3)condition_on_previous_text for distil models (prevents repetition loops)--condition-on-previous-text to override; --log-level for library debug output--model-dir PATH ā Custom HuggingFace cache dir; local CTranslate2 model support--no-timestamps, --chunk-length, --length-penalty, --repetition-penalty, --no-repeat-ngram-size--clip-timestamps, --stream, --progress, --best-of, --patience, --max-new-tokens--hotwords, --prefix, --revision, --suppress-tokens, --max-initial-timestampSpeaker & quality:
--speaker-names "Alice,Bob" ā Replace SPEAKER_1/2 with real names (requires --diarize)--filter-hallucinations ā Remove music/applause markers, duplicates, "Thank you for watching"--burn-in OUTPUT ā Burn subtitles into video via ffmpeg--keep-temp ā Preserve URL-downloaded audio for re-processingSetup:
setup.sh --check ā System diagnostic: GPU, CUDA, Python, ffmpeg, pyannote, HuggingFace token (completes in ~12s)skill.json updated to reflect this (ffmpeg is now optionalBins)"CUDA not available ā using CPU": Install PyTorch with CUDA (see GPU Support above)
Setup fails: Make sure Python 3.10+ is installed
Out of memory: Use smaller model, --compute-type int8, or --batch-size 4
Slow on CPU: Expected ā use GPU for practical transcription
Model download fails: Check ~/.cache/huggingface/ permissions
Diarization model fails: Ensure HuggingFace token exists and model agreements accepted;
or pass token directly with --hf-token hf_xxx
URL download fails: Check yt-dlp is installed (pipx install yt-dlp)
No audio files in batch: Check file extensions match supported formats
Check installed version: Run ./scripts/transcribe --version
Upgrade faster-whisper: Run ./setup.sh --update (upgrades in-place, no full reinstall)
Hallucinations on silence/music: Try --temperature 0.0 --no-speech-threshold 0.8
VAD splits speech incorrectly: Tune with --vad-threshold 0.3 (lower) or --min-silence-duration 300
Improve speech detection: Run ./setup.sh --update to upgrade faster-whisper to the latest version (includes Silero VAD V6).
Generated Mar 1, 2026
Podcasters can transcribe episodes for SEO, accessibility, and show notes. The skill supports batch processing of RSS feeds, speaker diarization to label hosts and guests, and export of subtitles in formats like SRT for video platforms.
Educational institutions can transcribe lectures and seminars for student notes, closed captions, and archival. The skill handles multilingual content, detects chapters for easy navigation, and allows search within transcripts for study purposes.
Businesses can transcribe meetings, interviews, and conferences to create searchable records and action items. Features like speaker diarization identify participants, while CSV output facilitates analysis and reporting in spreadsheets.
Content creators and media companies can generate subtitles in multiple formats (e.g., SRT, VTT) for videos, including YouTube links. The skill supports translation to English and batch processing, enabling efficient localization for global audiences.
Healthcare providers can transcribe patient consultations and medical notes locally for privacy. The skill's initial-prompt feature helps with medical jargon, while denoising improves accuracy in noisy clinical environments.
Offer a cloud-based platform where users upload audio files for automated transcription. Leverage the skill's GPU acceleration for fast processing and charge per minute of audio, with premium features like speaker diarization and custom formats.
Sell licensed software packages to large organizations needing offline, secure transcription. Bundle with support and training, using the skill's batch processing and integration capabilities for internal workflows like legal or media production.
Operate a service where freelancers use the tool to transcribe audio for clients in industries like academia or podcasting. Scale by automating bulk jobs with batch processing and offering add-ons like subtitle formatting or translation.
š¬ Integration Tip
Ensure Python3 and optional tools like ffmpeg are installed; cache Hugging Face tokens for seamless model access to avoid interruptions.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.