skill-spotlightspeech-audioopenai-whisper-apiclawhubopenclawwhispertranscriptionspeech-to-text

OpenAI Whisper API Skill: One-Command Audio Transcription with Cloud-Grade Accuracy

March 13, 2026·6 min read

14,224 downloads and 30 stars. The OpenAI Whisper API skill — by steipete (Peter Steinberger, founder of OpenClaw/PSPDFKit) — wraps OpenAI's /v1/audio/transcriptions endpoint in a clean bash script that any agent can call with a single command.

The setup is minimal: one environment variable (OPENAI_API_KEY), one curl dependency, one command. The output is a text file or JSON alongside your audio file.

The Problem It Solves

Audio is the last major unstructured format that most AI agent pipelines don't handle natively. Meeting recordings, podcast episodes, voice memos, interview recordings, customer support calls — all of these exist as audio, and most agent workflows have no way to read them.

Whisper, OpenAI's speech recognition model, is among the most accurate transcription systems available: ~92% accuracy overall (8% Word Error Rate), strong performance on accented speech and technical vocabulary, and support for 99 languages. The Whisper API skill makes this accessible to any OpenClaw agent with no infrastructure setup.

Note: steipete published two complementary Whisper skills. This one (openai-whisper-api) calls OpenAI's cloud API and requires an OPENAI_API_KEY. The other (openai-whisper) runs Whisper locally with no API key required — 38,900+ downloads and 187 stars. Choose based on whether you want zero cost (local) or zero infrastructure (API).

Core Concept: curl-Based Transcription

The skill is intentionally simple. A bash script (scripts/transcribe.sh) wraps a curl call to OpenAI's audio transcription endpoint. The only system requirement is curl — no Python, no Node.js, no local model downloads.

{baseDir}/scripts/transcribe.sh /path/to/audio.m4a

Defaults:

Model: whisper-1 (OpenAI's hosted Whisper)
Output: <input>.txt next to the input file

Deep Dive

Basic Transcription

# Transcribe and save to audio.m4a.txt
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a
 
# Specify output path
{baseDir}/scripts/transcribe.sh /path/to/audio.ogg --out /tmp/transcript.txt

Language Hint

Providing a language hint improves both speed and accuracy:

{baseDir}/scripts/transcribe.sh /path/to/audio.m4a --language en

Supported languages include all major European, Asian, and Middle Eastern languages. Whisper auto-detects language if omitted, but the hint eliminates detection overhead.

Custom Prompt for Proper Nouns

Whisper's accuracy on technical terms, speaker names, and brand names can be improved with a prompt:

{baseDir}/scripts/transcribe.sh /path/to/audio.m4a --prompt "Speaker names: Peter, Daniel. Company: PSPDFKit."

The prompt helps Whisper bias toward the vocabulary you expect. Use it for meetings with specific participants, technical domains, or proprietary names.

JSON Output with Timestamps

For workflows that need timestamped segments (building a table of contents, linking to specific moments):

{baseDir}/scripts/transcribe.sh /path/to/audio.m4a --json --out /tmp/transcript.json

The JSON response includes word-level or segment-level timestamps depending on the timestamp_granularities parameter.

API Key Configuration

Two options:

Environment variable (recommended):

export OPENAI_API_KEY="sk-..."
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a

Config file:

// ~/.clawdbot/clawdbot.json
{
  skills: {
    "openai-whisper-api": {
      apiKey: "YOUR_OPENAI_API_KEY"
    }
  }
}

Supported Audio Formats

OpenAI's Whisper API accepts: flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm

Maximum file size: 25MB. For longer recordings, split the audio before transcribing (tools like ffmpeg work well for this).

Comparison: Whisper API vs. Alternatives

Solution	Accuracy (WER)	Languages	No Infra	Cost
OpenAI Whisper API	~8%	✅ 99	✅	$0.006/min
AssemblyAI Universal-2	~8.4% (best accuracy)	✅	✅	$0.0025/min
Deepgram Nova-2	Competitive	✅ 30+	✅	$0.0043/min
Local Whisper (large-v3)	~8%	✅ 99	❌ Needs GPU	Free at scale
Google Speech-to-Text	Higher WER	✅ 125	✅	$0.016/min
OpenAI gpt-4o-transcribe	Lower than Whisper	✅	✅	Higher

Note: AssemblyAI Universal-2 is now slightly cheaper and slightly more accurate than Whisper API for pre-recorded content. Whisper's main advantages are 99-language breadth and the simplicity of staying within the OpenAI ecosystem.

At $0.006/minute, Whisper API is cost-competitive while requiring zero local infrastructure. A 1-hour meeting costs $0.36 to transcribe.

How to Install

clawhub install openai-whisper-api

Requirements:

curl (standard on macOS/Linux; Windows users may need to verify)
OPENAI_API_KEY environment variable

No Python, no Node.js, no model downloads. The skill is self-contained.

Practical Tips

Always provide --language for non-English audio. Auto-detection adds latency and occasional errors on short clips. If you know the language, specify it.
Use --prompt for domain-specific content. Technical interviews, medical consultations, and engineering discussions all have vocabulary that Whisper may misrecognize. A 20-word prompt dramatically improves accuracy for these cases.
Split long recordings before sending. The 25MB limit means approximately 25 minutes of typical MP3 audio (128kbps). Use ffmpeg -i input.mp3 -f segment -segment_time 900 -c copy output_%03d.mp3 to split by time.
Combine with Claude for meeting summaries. The typical workflow: transcribe with Whisper → pass full_text to Claude with a summarization prompt → extract action items. The skill handles step one reliably.
Save both .txt and --json formats. Plain text for immediate summarization; JSON for building search indexes or linking to specific timestamps.
The --prompt field persists across the whole transcription. It's not a prefix — Whisper uses it as style guidance throughout. Keep prompts factual (names, terms) rather than instructional.

Considerations

API key required. Unlike the yahoo-finance or web-search skills, this one requires a paid OpenAI API key. Whisper API is not available on the free tier.
25MB file size limit. This covers approximately 25–30 minutes of standard quality audio. For longer recordings, implement a split-and-concatenate workflow in your agent.
Not real-time. The Whisper API processes complete files, not streaming audio. For real-time transcription, use OpenAI's Realtime API (different endpoint, different skill).
Accuracy degrades with background noise. In controlled environments (meetings, voice memos), Whisper is extremely accurate. In noisy environments (street recordings, crowded rooms), expect significant degradation.
No speaker diarization. Whisper doesn't identify who is speaking. For multi-speaker recordings that need speaker labels, consider Pyannote (open source) or AssemblyAI (paid) for post-processing.

The Bigger Picture

The Whisper API skill represents the simplest possible implementation of a powerful capability: converting the audio world into text that agents can reason about. By wrapping the API in a curl script, steipete made it dependency-free and universally accessible — the same pattern he applied with nano-pdf (Gemini image editing) and other lightweight utility skills.

For AI agents, the practical impact is significant. Meeting recordings, podcasts, voice messages, and phone call recordings can now become first-class input for analysis, summarization, and action item extraction. A 1-hour strategy call costs 36 cents to process and produces text that Claude can analyze in seconds.

View the skill on ClawHub: openai-whisper-api

← Back to Blog