mlx-audio-serverLocal 24x7 OpenAI-compatible API server for STT/TTS, powered by MLX on your Mac.
Install via ClawdBot CLI:
clawdbot install guoqiao/mlx-audio-serverLocal 24x7 OpenAI-compatible API server for STT/TTS, powered by MLX on your Mac.
mlx-audio: The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.
guoqiao/tap/mlx-audio-server: Homebrew Formula to install mlx-audio with brew, and run mlx_audio.server as a LaunchAgent service on macOS.
mlx: macOS with Apple Siliconbrew: used to install deps if not availablebash ${baseDir}/install.sh
This script will:
mlx-audio-server from guoqiao/tapmlx-audio-serverSTT/Speech-To-Text(default model: mlx-community/glm-asr-nano-2512-8bit):
# input will be converted to wav with ffmpeg, if not yet.
# output will be transcript text only.
bash ${baseDir}/run_stt.sh <audio_or_video_path>
TTS/Text-To-Speech(default model: mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16):
# audio will be saved into a tmp dir, with default name `speech.wav`, and print to stdout.
bash ${baseDir}/run_tts.sh "Hello, Human!"
# or you can specify a output dir
bash ${baseDir}/run_tts.sh "Hello, Human!" ./output
# output will be audio path only.
You can use both scripts directly, or as example/reference.
Generated Mar 1, 2026
Podcasters and video creators can use this skill to transcribe audio or video files locally on their Mac without relying on cloud services, ensuring privacy and reducing costs. It's ideal for generating subtitles, show notes, or repurposing content into text-based formats like blog posts.
Schools and universities can deploy this on Mac mini servers to provide speech-to-text and text-to-speech services for students with disabilities, such as converting lecture recordings to text or creating audio versions of study materials. It offers a low-cost, on-premise solution that complies with data privacy regulations.
Developers building AI or voice applications can use this skill as a local, OpenAI-compatible API server to test speech recognition and synthesis features without internet dependency. It accelerates prototyping for apps like voice assistants, transcription tools, or interactive media on Apple Silicon devices.
Small teams can run this skill on a shared MacBook to transcribe internal meetings or customer calls locally, keeping sensitive discussions secure and avoiding subscription fees. The output can be used for minutes, action items, or training documentation.
Offer a free version with basic STT/TTS models and charge for premium features like advanced models, higher accuracy, or commercial licensing. This targets developers and small businesses looking for cost-effective, privacy-focused alternatives to cloud APIs.
Partner with Apple resellers to pre-install this skill on Mac mini or MacBook devices sold as dedicated transcription or accessibility workstations. This provides an out-of-the-box solution for industries like education or healthcare, with support and maintenance contracts.
Provide consulting and integration services to large organizations needing tailored STT/TTS solutions, such as integrating with existing workflows or training custom models. This leverages the local, secure nature of the skill for compliance-heavy sectors like finance or legal.
💬 Integration Tip
Ensure ffmpeg and jq are installed via brew for audio processing, and use the provided scripts as examples to integrate STT/TTS into custom applications via the local API server.
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
Local speech-to-text with the Whisper CLI (no API key).
ElevenLabs text-to-speech with mac-style say UX.
Text-to-speech conversion using node-edge-tts npm package for generating audio from text. Supports multiple voices, languages, speed adjustment, pitch control, and subtitle generation. Use when: (1) User requests audio/voice output with the "tts" trigger or keyword. (2) Content needs to be spoken rather than read (multitasking, accessibility, driving, cooking). (3) User wants a specific voice, speed, pitch, or format for TTS output.
End-to-end encrypted agent-to-agent private messaging via Moltbook dead drops. Use when agents need to communicate privately, exchange secrets, or coordinate without human visibility.
Text-to-speech via OpenAI Audio Speech API.