transcribeeTranscribe YouTube videos and local audio/video files with speaker diarization. Use when user asks to transcribe a YouTube URL, podcast, video, or audio file. Outputs clean speaker-labeled transcripts ready for LLM analysis.
Install via ClawdBot CLI:
clawdbot install itsfabioroma/transcribeeTranscribe YouTube videos and local media files with speaker diarization via ElevenLabs.
# YouTube video
transcribee "https://www.youtube.com/watch?v=..."
# Local video
transcribee ~/path/to/video.mp4
# Local audio
transcribee ~/path/to/podcast.mp3
Always quote URLs containing & or special characters.
Transcripts save to: ~/Documents/transcripts/{category}/{title}-{date}/
| File | Use |
|------|-----|
| transcription.txt | Speaker-labeled transcript |
| transcription-raw.txt | Plain text, no speakers |
| transcription-raw.json | Word-level timings |
| metadata.json | Video info, language, category |
brew install yt-dlp ffmpeg
| Error | Fix |
|-------|-----|
| yt-dlp not found | brew install yt-dlp |
| ffmpeg not found | brew install ffmpeg |
| API errors | Check .env file in transcribee directory |
Generated Mar 1, 2026
Researchers can transcribe interviews or lectures from YouTube or local recordings for qualitative analysis. Speaker diarization helps identify different participants, making it easier to code and analyze dialogue in studies.
Podcasters and video creators use this to generate accurate transcripts for subtitles, show notes, or repurposing content into blog posts. The speaker-labeled output streamlines editing and enhances accessibility for audiences.
Law firms transcribe depositions, court proceedings, or client meetings from audio/video files. The word-level timings in JSON provide precise references, aiding in case preparation and evidence organization.
Companies transcribe training sessions or webinars to create searchable archives and learning materials. Speaker diarization allows tracking of different trainers or participants, improving knowledge management and compliance.
Journalists transcribe interviews or press conferences from YouTube or local recordings to produce accurate articles and quotes. The clean transcripts save time on manual transcription and reduce errors in reporting.
Offer a free tier with limited transcriptions per month and paid plans for higher volume or advanced features like batch processing. This attracts individual users and scales with businesses needing regular transcription services.
License the transcription engine to other software platforms, such as video editing tools or learning management systems. This generates revenue through API calls and integration fees from enterprise clients.
Provide customized packages for large organizations in legal, academic, or corporate sectors, including on-premise deployment, dedicated support, and compliance features. This targets high-value clients with specific security and workflow needs.
💬 Integration Tip
Ensure the .env file is properly configured with API keys and dependencies like yt-dlp and ffmpeg are installed to avoid common errors during transcription.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.