skill-spotlightspeech-audioopenai-whisperclawhubopenclaw

OpenAI Whisper (Local): Private, Offline Speech-to-Text — No API Key Required

March 15, 2026·6 min read

13,400+ downloads and 70 stars — the Whisper (Local) Skill by @steipete brings OpenAI's industry-leading speech recognition to your Clawdbot workflow with zero cloud dependency and zero API key. Install the openai-whisper brew package, and every audio file on your machine becomes instantly transcribable.

Note: This is the local Whisper skill, which runs the model entirely on your device. If you need cloud-based transcription with the latest Whisper API and GPT-4o audio models, see the separate openai-whisper-api skill.

The Problem It Solves

Audio transcription is one of those tasks that seems simple until you hit the privacy wall. Uploading your meeting recordings, legal depositions, or research interviews to a cloud API means trusting a third party with potentially sensitive content. Local Whisper solves this completely — your audio never leaves your machine. And for heavy users, the cost math is simple: $0 per transcription, forever, regardless of volume.

How It Works

OpenAI released Whisper as an open-source model that runs on consumer hardware. The whisper CLI (installed via Homebrew's openai-whisper formula) loads the model locally and processes audio files entirely in your CPU/GPU. The Clawdbot skill provides the context layer so your agent knows how to invoke it correctly.

The skill defaults to the turbo model — OpenAI's fastest accurate option — but you can choose from the full model ladder:

Model	Size	Speed	Best For
tiny	~39MB	Fastest	Quick drafts, real-time
base	~74MB	Very fast	Short clips
small	~244MB	Fast	General use
medium	~769MB	Moderate	High accuracy
large	~1.5GB	Slow	Maximum accuracy
turbo	~809MB	Fast	Default — best balance

Models download automatically to ~/.cache/whisper on first use and are cached for subsequent runs.

Installation

# Install the Whisper CLI via Homebrew (macOS)
brew install openai-whisper
 
# Verify installation
whisper --help

Clawdbot automatically handles installation via the skill's install manifest when you first use it.

Core Usage

Basic Transcription

# Transcribe to text file (simplest)
whisper /path/to/audio.mp3 --model turbo --output_format txt --output_dir .
 
# Transcribe to multiple formats at once
whisper /path/to/recording.m4a --model medium --output_format all --output_dir ./transcripts

Translation to English

# Translate non-English audio directly to English text
whisper /path/to/french-meeting.mp3 --task translate --output_format srt

Output Formats

Format	Flag	Use Case
`txt`	`--output_format txt`	Plain text, copy-paste ready
`srt`	`--output_format srt`	Video subtitles
`vtt`	`--output_format vtt`	Web video subtitles
`tsv`	`--output_format tsv`	Spreadsheet import
`json`	`--output_format json`	Programmatic processing
`all`	`--output_format all`	All formats at once

Model Selection for Performance

# For speed (laptops, quick drafts)
whisper audio.mp3 --model tiny
 
# For accuracy (important recordings)
whisper audio.mp3 --model large
 
# Specify language explicitly for better accuracy
whisper audio.mp3 --model medium --language Japanese

Practical Workflows

Podcast episode transcription:

whisper episode-47.mp3 --model medium --output_format txt --output_dir ./show-notes

Batch transcription (multiple files):

whisper *.mp3 --model turbo --output_format srt --output_dir ./subtitles

Legal/medical transcription (maximum accuracy, keep local):

whisper deposition.m4a --model large --language en --output_format txt --output_dir .

How Clawdbot Uses It

With this skill installed, you can ask Clawdbot naturally:

"Transcribe the meeting recording at ~/Downloads/standup.mp3"
"Translate this French audio file to English text"
"Create SRT subtitles for my video using the medium model"

Clawdbot handles the command construction and file path resolution automatically.

Choosing Between Local Whisper Skills

ClawHub has three local Whisper skills — here's how they compare:

	openai-whisper (this)	faster-whisper	local-whisper (araa47)
Backend	OpenAI Whisper CLI (PyTorch)	CTranslate2	Whisper via `uv` + Python venv
API key required	No	No	No
Speed	Moderate	4–6x faster (CPU), ~20x (GPU)	Similar to this skill
Speaker diarization	No	Yes	No
Word-level timestamps	No	Yes	Yes
YouTube URL input	No	Yes	No
Output formats	txt/srt/vtt/tsv/json	SRT/VTT/ASS/LRC + more	JSON + quiet mode
Setup	`brew install`	More complex	`uv` based

Recommendation:

This skill — simplest path: one brew install, done. Good for most users.
faster-whisper — power users who need speed, multi-speaker recordings, or batch processing.
local-whisper — if you prefer uv/venv isolation and need JSON output or quiet mode.

Local vs. API: Cost and Privacy

	Local (this skill)	API (openai-whisper-api)
Cost	Free forever	$0.006/min
10-hour archive	$0	$3.60
Privacy	100% on-device	Audio sent to OpenAI
Speed	Depends on hardware	Fast, consistent
Latest models	Whisper only	GPT-4o transcribe, diarization
Internet required	No	Yes
Setup	`brew install`	API key required

Use local if: you handle sensitive audio, transcribe in bulk, or want zero ongoing cost. Use API if: you need speaker diarization, the very latest accuracy, or have limited compute.

Considerations

macOS only — the Homebrew install path is macOS-specific; Linux users can install openai-whisper via pip
First run is slow — model download happens on first use; large models take several minutes on slow connections
CPU-heavy — the large model on CPU can be very slow; M-series Macs handle it well via Metal GPU acceleration
English bias — Whisper is trained heavily on English; accuracy varies by language, especially for low-resource languages
No speaker diarization — local Whisper doesn't identify who said what; for multi-speaker recordings, the API version or separate tools are needed

The Bigger Picture

The openai-whisper local skill represents a broader trend: open-weight models making cloud services optional for an entire category of AI tasks. For speech recognition, OpenAI's decision to release Whisper publicly means teams handling HIPAA data, legal recordings, or proprietary business discussions can get state-of-the-art transcription without any compliance risk. Steipete — who also authored the popular apple-reminders skill — packaged this into a seamless Clawdbot integration that takes seconds to install.

View the skill on ClawHub: openai-whisper

← Back to Blog