OpenAI Whisper API Skill: One-Command Audio Transcription with Cloud-Grade Accuracy
14,224 downloads and 30 stars. The OpenAI Whisper API skill — by steipete (Peter Steinberger, founder of OpenClaw/PSPDFKit) — wraps OpenAI's /v1/audio/transcriptions endpoint in a clean bash script that any agent can call with a single command.
The setup is minimal: one environment variable (OPENAI_API_KEY), one curl dependency, one command. The output is a text file or JSON alongside your audio file.
The Problem It Solves
Audio is the last major unstructured format that most AI agent pipelines don't handle natively. Meeting recordings, podcast episodes, voice memos, interview recordings, customer support calls — all of these exist as audio, and most agent workflows have no way to read them.
Whisper, OpenAI's speech recognition model, is among the most accurate transcription systems available: ~92% accuracy overall (8% Word Error Rate), strong performance on accented speech and technical vocabulary, and support for 99 languages. The Whisper API skill makes this accessible to any OpenClaw agent with no infrastructure setup.
Note: steipete published two complementary Whisper skills. This one (
openai-whisper-api) calls OpenAI's cloud API and requires anOPENAI_API_KEY. The other (openai-whisper) runs Whisper locally with no API key required — 38,900+ downloads and 187 stars. Choose based on whether you want zero cost (local) or zero infrastructure (API).
Core Concept: curl-Based Transcription
The skill is intentionally simple. A bash script (scripts/transcribe.sh) wraps a curl call to OpenAI's audio transcription endpoint. The only system requirement is curl — no Python, no Node.js, no local model downloads.
{baseDir}/scripts/transcribe.sh /path/to/audio.m4aDefaults:
- Model:
whisper-1(OpenAI's hosted Whisper) - Output:
<input>.txtnext to the input file
Deep Dive
Basic Transcription
# Transcribe and save to audio.m4a.txt
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a
# Specify output path
{baseDir}/scripts/transcribe.sh /path/to/audio.ogg --out /tmp/transcript.txtLanguage Hint
Providing a language hint improves both speed and accuracy:
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a --language enSupported languages include all major European, Asian, and Middle Eastern languages. Whisper auto-detects language if omitted, but the hint eliminates detection overhead.
Custom Prompt for Proper Nouns
Whisper's accuracy on technical terms, speaker names, and brand names can be improved with a prompt:
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a --prompt "Speaker names: Peter, Daniel. Company: PSPDFKit."The prompt helps Whisper bias toward the vocabulary you expect. Use it for meetings with specific participants, technical domains, or proprietary names.
JSON Output with Timestamps
For workflows that need timestamped segments (building a table of contents, linking to specific moments):
{baseDir}/scripts/transcribe.sh /path/to/audio.m4a --json --out /tmp/transcript.jsonThe JSON response includes word-level or segment-level timestamps depending on the timestamp_granularities parameter.
API Key Configuration
Two options:
Environment variable (recommended):
export OPENAI_API_KEY="sk-..."
{baseDir}/scripts/transcribe.sh /path/to/audio.m4aConfig file:
// ~/.clawdbot/clawdbot.json
{
skills: {
"openai-whisper-api": {
apiKey: "YOUR_OPENAI_API_KEY"
}
}
}Supported Audio Formats
OpenAI's Whisper API accepts: flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm
Maximum file size: 25MB. For longer recordings, split the audio before transcribing (tools like ffmpeg work well for this).
Comparison: Whisper API vs. Alternatives
| Solution | Accuracy (WER) | Languages | No Infra | Cost |
|---|---|---|---|---|
| OpenAI Whisper API | ~8% | ✅ 99 | ✅ | $0.006/min |
| AssemblyAI Universal-2 | ~8.4% (best accuracy) | ✅ | ✅ | $0.0025/min |
| Deepgram Nova-2 | Competitive | ✅ 30+ | ✅ | $0.0043/min |
| Local Whisper (large-v3) | ~8% | ✅ 99 | ❌ Needs GPU | Free at scale |
| Google Speech-to-Text | Higher WER | ✅ 125 | ✅ | $0.016/min |
| OpenAI gpt-4o-transcribe | Lower than Whisper | ✅ | ✅ | Higher |
Note: AssemblyAI Universal-2 is now slightly cheaper and slightly more accurate than Whisper API for pre-recorded content. Whisper's main advantages are 99-language breadth and the simplicity of staying within the OpenAI ecosystem.
At $0.006/minute, Whisper API is cost-competitive while requiring zero local infrastructure. A 1-hour meeting costs $0.36 to transcribe.
How to Install
clawhub install openai-whisper-apiRequirements:
curl(standard on macOS/Linux; Windows users may need to verify)OPENAI_API_KEYenvironment variable
No Python, no Node.js, no model downloads. The skill is self-contained.
Practical Tips
- Always provide
--languagefor non-English audio. Auto-detection adds latency and occasional errors on short clips. If you know the language, specify it. - Use
--promptfor domain-specific content. Technical interviews, medical consultations, and engineering discussions all have vocabulary that Whisper may misrecognize. A 20-word prompt dramatically improves accuracy for these cases. - Split long recordings before sending. The 25MB limit means approximately 25 minutes of typical MP3 audio (128kbps). Use
ffmpeg -i input.mp3 -f segment -segment_time 900 -c copy output_%03d.mp3to split by time. - Combine with Claude for meeting summaries. The typical workflow: transcribe with Whisper → pass
full_textto Claude with a summarization prompt → extract action items. The skill handles step one reliably. - Save both
.txtand--jsonformats. Plain text for immediate summarization; JSON for building search indexes or linking to specific timestamps. - The
--promptfield persists across the whole transcription. It's not a prefix — Whisper uses it as style guidance throughout. Keep prompts factual (names, terms) rather than instructional.
Considerations
- API key required. Unlike the
yahoo-financeorweb-searchskills, this one requires a paid OpenAI API key. Whisper API is not available on the free tier. - 25MB file size limit. This covers approximately 25–30 minutes of standard quality audio. For longer recordings, implement a split-and-concatenate workflow in your agent.
- Not real-time. The Whisper API processes complete files, not streaming audio. For real-time transcription, use OpenAI's Realtime API (different endpoint, different skill).
- Accuracy degrades with background noise. In controlled environments (meetings, voice memos), Whisper is extremely accurate. In noisy environments (street recordings, crowded rooms), expect significant degradation.
- No speaker diarization. Whisper doesn't identify who is speaking. For multi-speaker recordings that need speaker labels, consider Pyannote (open source) or AssemblyAI (paid) for post-processing.
The Bigger Picture
The Whisper API skill represents the simplest possible implementation of a powerful capability: converting the audio world into text that agents can reason about. By wrapping the API in a curl script, steipete made it dependency-free and universally accessible — the same pattern he applied with nano-pdf (Gemini image editing) and other lightweight utility skills.
For AI agents, the practical impact is significant. Meeting recordings, podcasts, voice messages, and phone call recordings can now become first-class input for analysis, summarization, and action item extraction. A 1-hour strategy call costs 36 cents to process and produces text that Claude can analyze in seconds.
View the skill on ClawHub: openai-whisper-api