video-captionsGenerate professional captions and subtitles with multi-engine transcription, word-level timing, styling presets, and burn-in.
Install via ClawdBot CLI:
clawdbot install ivangdavila/video-captionsRequires:
User needs captions or subtitles for video content. Agent handles transcription, timing, formatting, styling, translation, and burn-in across all major formats and platforms.
| Topic | File |
|-------|------|
| Transcription engines | engines.md |
| Output formats | formats.md |
| Styling presets | styling.md |
| Platform requirements | platforms.md |
| Scenario | Engine | Why |
|----------|--------|-----|
| Default (recommended) | Whisper local | 100% offline, no data leaves machine |
| Apple Silicon | MLX Whisper | Native acceleration, still local |
| Word timestamps | whisper-timestamped | DTW alignment, still local |
Default: Whisper local (turbo model). See engines.md for optional cloud alternatives.
| Platform | Format | Notes |
|----------|--------|-------|
| YouTube | VTT or SRT | VTT preferred |
| Netflix/Pro | TTML | Strict timing rules |
| Social (TikTok, IG) | Burn-in (ASS) | Embedded in video |
| General | SRT | Universal compatibility |
| Karaoke/effects | ASS | Advanced styling |
Ask user's target platform if not specified.
Netflix-compliant (default):
Social media:
Break lines:
Never separate:
Use word timestamps for:
Enable with --word-timestamps flag.
For multi-speaker content:
[Speaker 1] or [Name] if knownJOHN: What do you think?Before delivering:
# Auto-detect language, output SRT
whisper video.mp4 --model turbo --output_format srt
# Specify language
whisper video.mp4 --model turbo --language es --output_format srt
# Multiple formats
whisper video.mp4 --model turbo --output_format all
# Using whisper-timestamped
whisper_timestamped video.mp4 --model large-v3 --output_format srt
# With VAD pre-processing (reduces hallucinations)
whisper_timestamped video.mp4 --vad silero --accurate
# Generate SRT first, then convert with style
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Arial,FontSize=24,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,Shadow=1,Alignment=2'" output.mp4
# TikTok/Instagram style (centered, bold)
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Montserrat-Bold,FontSize=32,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=3,Shadow=0,Alignment=10,MarginV=50'" output.mp4
# Netflix style (bottom, clean)
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Netflix Sans,FontSize=48,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,Shadow=1,Alignment=2'" output.mp4
# Transcribe + translate to English
whisper video.mp4 --model turbo --task translate --output_format srt
# SRT to VTT
ffmpeg -i video.srt video.vtt
# SRT to ASS (for styling)
ffmpeg -i video.srt video.ass
--language explicitly for mixed content--max_line_width 42 for Netflix compliance-b:v 8M)whisper video.mp4 --output_format vttffmpeg -i video.mp4 -vf "subtitles=video.ass" -c:a copy output.mp4[SPEAKER]: text[music], [laughter] descriptions--task translate for EnglishDefault: 100% LOCAL processing. No network calls.
| Endpoint | Data Sent | When Used |
|----------|-----------|-----------|
| Whisper (local) | None (local) | Default — always |
| api.assemblyai.com | Audio file | Only if user sets ASSEMBLYAI_API_KEY |
| api.deepgram.com | Audio file | Only if user sets DEEPGRAM_API_KEY |
Cloud APIs are documented as alternatives but never used unless user explicitly provides API keys and requests cloud processing. By default, all processing stays on your machine.
Default workflow is 100% offline:
Cloud APIs are OPTIONAL and OPT-IN:
ASSEMBLYAI_API_KEY or DEEPGRAM_API_KEYThis skill does NOT:
Install with clawhub install if user confirms:
ffmpeg — video/audio processingvideo — general video tasksvideo-edit — video editingaudio — audio processingclawhub star video-captionsclawhub syncGenerated Feb 26, 2026
Creators need accurate, platform-compliant captions for videos to improve accessibility and SEO. This skill generates VTT or SRT files with professional timing standards, ready for upload to YouTube Studio, ensuring sync and character limits are met.
Marketers require burned-in, styled captions for TikTok and Instagram Reels to enhance engagement and accessibility. The skill provides word-level timestamps for animated effects and applies bold, centered styling via FFmpeg, optimized for mobile viewing.
Production studios need Netflix-compliant subtitles in TTML format for streaming platforms, adhering to strict timing and formatting rules. This skill uses high-accuracy engines like Whisper large-v3 and verifies line limits and gaps for quality assurance.
Podcasters and journalists require multi-speaker transcription with diarization to label speakers and format dialogue. The skill enables local processing for privacy, outputting SDH-compliant captions with speaker IDs and non-speech descriptions.
Offer basic local transcription for free to attract users, with premium features like cloud engine integration, advanced styling, and batch processing via subscription plans. Revenue comes from monthly fees for high-volume or enterprise users.
License the skill to video editing software companies or production studios as an embedded tool, providing custom integrations and support. Revenue is generated through one-time licensing fees or annual contracts based on usage tiers.
Deploy the skill as a cloud API for developers, charging per minute of video processed with options for different engines and formats. This model scales with usage and appeals to apps needing automated caption generation without local setup.
💬 Integration Tip
Integrate with existing video workflows by using command-line tools like FFmpeg and Whisper, ensuring compatibility across Linux and macOS; provide clear documentation for env vars and platform-specific setups.
Extract frames or short clips from videos using ffmpeg.
Download videos, audio, subtitles, and clean paragraph-style transcripts from YouTube and any other yt-dlp supported site. Use when asked to “download this video”, “save this clip”, “rip audio”, “get subtitles”, “get transcript”, or to troubleshoot yt-dlp/ffmpeg and formats/playlists.
Generate SRT subtitles from video/audio with translation support. Transcribes Hebrew (ivrit.ai) and English (whisper), translates between languages, burns subtitles into video. Use for creating captions, transcripts, or hardcoded subtitles for WhatsApp/social media.
Create AI videos with optimized prompts, motion control, and platform-ready output.
自动登录抖音账号,上传并发布视频到抖音创作者平台,支持视频标签管理和登录状态检查。
AI video generation workflow on Volcengine. Use when users need text-to-video, image-to-video, generation parameter tuning, or async task troubleshooting for video jobs.