Package Exports
- @tuan_son.dinh/agent-voice-mcp
- @tuan_son.dinh/agent-voice-mcp/bin/claude-voice.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@tuan_son.dinh/agent-voice-mcp) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
agent-voice-mcp: Voice MCP Server
A Model Context Protocol (MCP) server that provides voice input/output integration for MCP clients such as Claude Code and Codex, enabling hands-free interaction and voice-driven workflows on macOS.
Talk to Claude Code or Codex by voice.
Typical flow:
You: Enable voice mode
Claude: Now in voice mode. What would you like to do?
You: Open the auth flow and fix the login bug OVER
Claude: I’ll inspect the auth flow and the login path now. Anything else?
You: No, continue OVERHow it works:
- Claude asks questions aloud
- you answer by voice
- OPTIONAL: say
OVERwhen you want your reply submitted right away, otherwise wait 3 seconds. - if you pause briefly, Claude can still wait for continuation before finalizing the turn
Quick Install
1. Install system dependencies on macOS
brew install espeak-ng ffmpeg2. Register the MCP server in Claude Code
claude mcp add agent-voice-mcp npx @tuan_son.dinh/agent-voice-mcpThis is the main install path. No repo clone is required.
Or add manually to ~/.claude.json under mcpServers:
{
"agent-voice-mcp": {
"command": "npx",
"args": ["@tuan_son.dinh/agent-voice-mcp"]
}
}On first run the server auto-installs its Python dependencies (~500MB). Subsequent starts are instant.
3. Restart Claude Code
If Claude Code is already running, restart it so the new MCP server is loaded.
That is enough for installation. No CLAUDE.md changes are required just to make the MCP server available.
Optional: Global Voice-Mode Prompt
The MCP server works without any prompt-file changes, but users will not get the same hands-free workflow unless their global prompt tells the agent when to enter voice mode and how to stay there.
If you want the same experience, add a section like this to your global CLAUDE.md:
## Voice I/O
- `agent-voice-mcp` MCP provides voice input/output integration for hands-free workflows
- Tool: `mcp__agent-voice-mcp__ask_user_voice` speaks questions aloud and records voice replies
- Tool: `mcp__agent-voice-mcp__speak_message` speaks short confirmations or updates
### Voice Mode Toggle
- Voice mode is OFF by default
- Voice mode is activated when your preferred trigger fires, for example a keybinding or explicit user request
- On activation, call `mcp__agent-voice-mcp__ask_user_voice` with: "Now in voice mode. What would you like to do?"
- Stay in voice mode for the rest of the conversation until the user explicitly exits it
### Voice Mode Rules
- While in voice mode, never ask for typed replies
- Use `mcp__agent-voice-mcp__ask_user_voice` for all follow-up questions
- Use `mcp__agent-voice-mcp__speak_message` for short spoken confirmations when useful
- Exit voice mode only when the user says something like "exit voice mode", "stop voice mode", or "back to text"
- Do not leave voice mode because of errors or tool failures; keep interacting by voice unless the user explicitly switches back to textThis prompt setup is recommended if you want the full voice-mode behavior. MCP installation alone only makes the tools available.
How To Use
Once the MCP server is installed, there are two common ways to start using it:
1. Ask Claude to enable voice mode
If your prompt setup includes voice-mode instructions, tell Claude something like:
Enable voice modeor:
Switch to voice modeClaude should then start using ask_user_voice for spoken interaction instead of asking for typed replies.
2. Trigger voice mode with a hook or keybinding
If you have a local hook configured, you can bind a key such as V to trigger voice mode.
Typical behavior:
- press
V - your hook tells Claude to enter voice mode
- Claude calls
ask_user_voicewith an initial spoken prompt such as: "Now in voice mode. What would you like to do?"
This is optional client-side setup. The MCP server does not create the V keybinding by itself; your local Claude workflow or hook needs to do that.
Leaving voice mode
If your prompt setup mirrors the example in this README, exit by saying something like:
exit voice modestop voice modeback to text
How Voice Replies Are Submitted
Users should be told explicitly how reply submission works.
The important rule is:
Say OVER to submit your response.More precisely, the current behavior is:
- the listener closes one spoken segment after about
500msof silence - after that, the server waits up to
3 secondsfor more speech - if the last spoken segment ends with
OVER, the response is submitted immediately after that segment closes OVERis removed from the final transcript and is treated as a submit keyword, not part of the answer
Practical examples:
"Please update the login page OVER"This submits the reply as soon as the segment closes.
"Please update the login page"This does not submit immediately. The server will still wait briefly for continuation before finalizing the reply.
"Please update the login page ... and also fix the header OVER"This can span multiple spoken segments. The reply is finalized when the final segment ends with OVER.
Recommended wording for users:
- say your answer normally
- end with
OVERwhen you want Claude to submit it - if you do not say
OVER, Claude may wait for a short continuation window before treating the response as finished
Other MCP Clients
Any stdio MCP client can run this server. For local development, the generic command is:
{
"agent-voice-mcp": {
"command": "uv",
"args": ["--directory", "/absolute/path/to/agent-voice-mcp", "run", "python", "-m", "lazy_claude"]
}
}Use your MCP client's normal server-registration flow and substitute the correct local path.
Features
- Voice Input: Speak questions and receive transcribed responses automatically submitted
- Text-to-Speech: Convert text responses to spoken audio with natural-sounding output
- Real-Time Recording: Stream audio input with automatic silence detection and submission
- Fast Model Loading: Efficient ML model management for speech recognition and synthesis
- Fully Integrated: Registered as
agent-voice-mcpMCP server in Claude Code
Prerequisites
System Requirements
- macOS (Apple Silicon or Intel) or Linux/Windows with fallback
- Python 3.12+
- System Tools:
brew install espeak-ng ffmpeg
Dependencies
Automatically installed via pyproject.toml:
kokoro— High-quality text-to-speech enginepywhispercpp— Fast speech-to-text (Whisper.cpp binding)sounddevice— Audio input/output (fallback path)soundfile— Audio file handlingonnxruntime— ML inference for TTS and VADmcp[cli]— Model Context Protocol server frameworknumpy— Numerical computing
macOS-specific (optional, for system AEC):
pyobjc-framework-AVFoundation— AVAudioEngine bindingspyobjc-framework-Foundation— Device change notifications
Install with macOS support:
pip install -e '.[macos]'
# or
uv sync --all-extrasSetup
1. Clone or Navigate to Project
cd /Users/sonwork/Workspace/agent-voice-mcp2. Install System Dependencies
brew install espeak-ng ffmpeg3. Set Up Python Environment
# Sync dependencies using uv
uv sync
# Or use pip
pip install -e .4. Register with Your MCP Client
Claude Code
The server can be registered in ~/.claude.json:
{
"agent-voice-mcp": {
"command": "uv",
"args": ["--directory", "/Users/sonwork/Workspace/agent-voice-mcp", "run", "python", "-m", "lazy_claude"]
}
}If not present, add this entry to the mcpServers object in ~/.claude.json.
Codex and Other MCP Clients
Register the same stdio command in your client's MCP config:
{
"agent-voice-mcp": {
"command": "uv",
"args": ["--directory", "/absolute/path/to/agent-voice-mcp", "run", "python", "-m", "lazy_claude"]
}
}No CLAUDE.md update is required. If your client supports project instructions, you can optionally add the short voice-tool guidance shown above.
Usage
Starting the Server
The server runs automatically when Claude Code loads the agent-voice-mcp MCP. No manual startup needed.
Available Tools
ask_user_voice(questions: list[str]) → str
Reads questions aloud and records voice replies with automatic submission.
Example:
result = ask_user_voice([
"What is your name?",
"What task should I help you with?"
])
# Output: "Q: What is your name?\nA: John\nQ: What task should I help you with?\nA: Fix the bug in the login page"Parameters:
questions— List of questions to ask the user
Returns: Formatted string with Q/A pairs (one per line): Q: question\nA: answer
speak_message(text: str) → dict
Converts text to speech and plays it aloud.
Example:
result = speak_message("Your task has been completed successfully")
# Output: {"status": "spoken", "chars": 45}Parameters:
text— Message to speak
Returns: {"status": "spoken", "chars": <character count>}
toggle_listening(enabled: bool) → dict
Enable or disable the voice input listening mode (useful for background noise or multi-tasking).
Example:
result = toggle_listening(enabled=False)
# Output: {"listening": False}Parameters:
enabled—Trueto enable,Falseto disable
Returns: {"listening": <bool>} — the current listening state after the change
Voice Workflow Integration
In Claude Code
If you want Claude Code to behave like a hands-free voice assistant, add the global prompt section above. A minimal version is:
Use the ask_user_voice tool to get the user's voice input on whether to proceed with deployment.Best Practices
- Ask Single Questions: Use clear, concise questions (one per call)
- Batch When Appropriate: Group related questions in a single call to reduce back-and-forth
- Mic Is Turn-Based: Voice capture is activated only during
ask_user_voice(...), not as an always-listening input mode - Handle Pauses: The listener segments after about
500msof silence, but the server keeps the reply open for up to3swaiting for continuation - Use
OVERTo Submit: Ending a segment withOVERfollowed by the normal500mssegment-close pause submits immediately and removesOVERfrom the returned transcript - Provide Context: Speak response confirmations back using
speak_messagefor feedback loops
Architecture
The server supports two audio backends, automatically selected at startup:
macOS System AEC Path (Preferred)
On macOS, the server uses AVAudioEngine with the Voice Processing IO audio unit
(same tech as FaceTime) for instant, hardware-level echo cancellation:
- Single
AVAudioBackendinstance manages both mic input and TTS output through oneAVAudioEnginewith voice processing enabled - Instant echo cancellation — no convergence time, system AEC handles it all
- Mic tap at native hardware rate (44.1 or 48 kHz) → resample to 16 kHz → deliver 512-sample chunks to VAD
- No custom AEC code — system provides automatic noise suppression + AGC
- No ReferenceBuffer — system AEC sees the output automatically
- Device change recovery — engine restarts on speaker/headphone/Bluetooth switches
See ARCHITECTURE.md for detailed macOS path flow and AVAudioBackend internals.
Fallback Path (Non-macOS or on Failure)
When macOS AVAudioEngine is unavailable or fails, the server falls back to sounddevice
with a custom adaptive AEC pipeline:
- PBFDLMS adaptive filter — Partitioned-block frequency-domain LMS filter that learns and cancels the speaker-to-microphone acoustic path in real time.
- Residual Echo Suppression (RES) — Spectral subtraction post-filter that catches echo components the adaptive filter misses (especially during early convergence).
- Geigel Double-Talk Detector (DTD) — Freezes filter adaptation when user speech and TTS overlap, preventing filter divergence.
- Fallback gate — Suppresses chunks with high residual power during TTS playback when AEC has not yet converged (safety net for the first few seconds).
- AEC calibration chirp — At startup, a quiet 1.5-second chirp (200–4000 Hz) plays to train the filter before real TTS utterances.
The mic stays open throughout TTS playback — no hard muting. Barge-in works by running VAD on the AEC-cleaned signal; when the user speaks, VAD fires and TTS stops.
Fallback components:
audio.py— ContinuousListener with sounddevice + VAD + barge-intts.py— TTSEngine with sounddevice output + ReferenceBuffer for AECaec.py— PBFDLMS filter + RES + DTD + device change handling
Shared Components
server.py— MCP server setup, path detection, tool registrationav_audio.py— AVAudioBackend, MacOSContinuousListener, MacOSTTSEngine (macOS only)stt.py— Speech-to-text using Whisper.cpp (both paths)stdout_guard.py— Prevents audio interference with Claude Code output
Manual AEC Test Script
To generate a before/after comparison of the echo cancellation:
# Uses a synthetic 440 Hz tone (no WAV needed):
python scripts/test_aec_manual.py
# Or pass your own WAV file:
python scripts/test_aec_manual.py /path/to/speech.wav --duration 10
# Output written to ./aec_test_output/
# raw_mic.wav - unprocessed microphone signal
# cleaned_aec.wav - after adaptive filter only
# cleaned_res.wav - after adaptive filter + residual echo suppressionModel Downloads
Models are cached locally:
- Whisper model →
~/.cache/huggingface/... - Kokoro TTS →
~/.cache/kokoro/...
First-run setup will download models (~500MB total). Subsequent runs use cached models.
Troubleshooting
Issue: No audio input detected
Solution:
- Check microphone permissions: Settings > Security & Privacy > Microphone
- Test audio device:
python -m sounddevice - Verify
espeak-ngandffmpeg:espeak-ng --version ffmpeg -version
Issue: Server fails to start
Solution:
- Verify Python 3.12+:
python --version - Check dependencies:
uv syncorpip install -e . - Test manual startup:
uv run python -m lazy_claude
Issue: Poor speech recognition
Solution:
- Speak clearly and at normal volume
- Reduce background noise
- Use
toggle_listening(False)to disable during loud environments
Issue: TTS playback issues
Solution:
- Check speaker volume and mute status
- Verify audio device:
python -m sounddevice - Test playback:
speak_message("Test")
First-Run Model Download Warnings
Warnings about missing HF_TOKEN or CUDA are normal. Models download from Hugging Face automatically.
Development
Running Tests
pytest tests/ -vRunning with Logging
uv run python -m lazy_claudeProject Structure
agent-voice-mcp/
├── lazy_claude/
│ ├── __main__.py # Entry point
│ ├── server.py # MCP server, path detection, tool definitions
│ ├── av_audio.py # macOS: AVAudioBackend, MacOSContinuousListener, MacOSTTSEngine
│ ├── audio.py # Fallback: ContinuousListener with sounddevice + VAD
│ ├── tts.py # Fallback: TTSEngine with sounddevice + ReferenceBuffer
│ ├── aec.py # Fallback: PBFDLMS + RES + DTD echo cancellation
│ ├── stt.py # Speech-to-text (both paths)
│ ├── stdout_guard.py # Output safety wrapper
│ └── models/ # Model files and utilities
├── scripts/
│ ├── test_aec_manual.py # Manual AEC before/after comparison (fallback path)
│ └── test_macos_aec.py # Manual macOS AEC test (system path)
├── tests/ # Test suite (unit + integration)
├── ARCHITECTURE.md # Detailed architecture guide
├── pyproject.toml # Project metadata and dependencies
└── README.md # This fileRequirements Specification
Hardware
- Microphone and speakers (or headphones)
- Minimum 4GB RAM for model loading
Software
- macOS 10.15+
- Python 3.12+
- espeak-ng and ffmpeg
Network
- Internet connection (first-run model downloads)
- No ongoing network usage after model caching
License
Part of the agent-voice-mcp project. See LICENSE for details.
Support
For issues or feature requests, refer to the project repository or contact the maintainer.