The Speech Pipeline
| Stage | What it does | Where to configure |
|---|---|---|
| STT — Speech-to-Text | Converts the user’s voice into text | Settings → Configuration |
| TTT — Text-to-Text | Processes the transcribed text and generates a response | Settings → Configuration |
| TTS — Text-to-Speech | Converts the response back into spoken audio | Settings → Configuration |
STT — Speech-to-Text
The STT stage is the entry point of the Agent’s pipeline. It listens to the user’s voice and converts it into text that the Agent can process. The platform and model you choose here directly affects how accurately the Agent understands the user, how quickly it responds, and which languages it can handle.Deepgram
Deepgram is a real-time speech recognition platform built for low-latency, high-accuracy transcription. It is well suited for voice agents that need fast, reliable transcription across a wide range of languages and audio conditions.| Model | Description | Language support |
|---|---|---|
nova-3 | Deepgram’s most capable and most recent model. Delivers the lowest word error rate with real-time multilingual transcription and domain-specific terminology comprehension. | 10+ languages in multilingual mode including English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. 50+ additional languages available. |
nova-2 | Deepgram’s second-generation Nova model. A strong general-purpose option, recommended when a language not yet covered by nova-3 is required or when filler word identification is needed. | 25+ languages including English, Spanish, French, German, Chinese, Japanese, Korean, Hindi, Portuguese, Russian, and more. |
nova-2-phonecall | A variant of nova-2 optimized for phone call audio. Addresses the acoustic challenges of telephonic recordings. | English only. |
nova | The original first-generation Nova model. Suitable for straightforward English transcription use cases. | English and Spanish. |
nova-phonecall | The phone-call-specialized variant of the original Nova model. | English only. |
enhanced | Delivers lower word error rates than the base tier with high-accuracy timestamps and keyword boosting support. | ~17 languages including English, Spanish, French, German, Italian, Japanese, Korean, Hindi, and Portuguese. |
enhanced-phonecall | The phone-call-specialized variant of the Enhanced model. | English only. |
base | Deepgram’s foundational model tier. Recommended for large transcription volumes where high-accuracy timestamps are required. | ~22 languages including English, Spanish, French, German, Chinese, Japanese, Korean, Hindi, Portuguese, Russian, and Turkish. |
base-phonecall | The phone-call-specialized variant of the Base model, designed for telephonic audio at high volume. | English only. |
ElevenLabs Scribe
ElevenLabs Scribe is ElevenLabs’ real-time speech recognition model. It is optimized for live streaming, interactive AI agents, and any use case requiring near-instant transcription.| Model | Description | Language support |
|---|---|---|
scribe_v2_realtime | State-of-the-art real-time transcription with ~150ms latency. Delivers high accuracy in live and interactive settings with automatic language detection. | 90+ languages with automatic language detection. |
AssemblyAI
AssemblyAI is a speech recognition platform with a focus on accuracy and broad language coverage. It is a good alternative when wider language support or different accuracy characteristics are needed.| Model | Description | Language support |
|---|---|---|
slam_1 | AssemblyAI’s highest-accuracy model. Best for use cases that require the most precise transcription. | English, Spanish, French, German, Portuguese, Italian. |
universal | A versatile model balancing accuracy and language coverage. Suitable for most general transcription needs. | 99 languages. |
nano | A lightweight model optimized for speed and low resource usage. Best for cost-sensitive or latency-sensitive use cases. | 99 languages. |
A note on short utterances
STT engines use a Voice Activity Detector (VAD) to determine when you have finished speaking. The engine listens for a silence gap after your voice, and only once that gap is detected does it finalize the transcription and pass it to the Agent. When you say a single short word, the engine may not detect a clean silence gap quickly — especially if there is background noise — so the response can feel delayed. Speaking in short phrases produces faster, more reliable results.General interaction
Instead of saying a single word and waiting, speak a short, complete phrase:| Less reliable | More reliable |
|---|---|
| ”weather" | "what’s the weather like right now?" |
| "news" | "give me today’s headlines" |
| "alarm" | "set an alarm for 7am tomorrow" |
| "joke" | "tell me a quick joke" |
| "timer" | "start a 5 minute timer for me” |
Wake word
The same applies when using a wake word. Saying the wake word alone and pausing can cause a slow response because the engine waits for more audio to confirm the utterance is complete. Pair the wake word with a short phrase to help the engine finalize faster.| Less reliable | More reliable |
|---|---|
| ”openhome” (pause) | “openhome, what’s the weather?" |
| "hey openhome” (pause) | “hey openhome, I have a question" |
| "openhome” (pause) | “openhome, remind me about my meeting" |
| "openhome” (pause) | “openhome, how are you doing today?” |
Music mode
In music mode, background audio fills the silence the VAD is listening for, making single-word commands harder to detect. If you say just “stop” or “pause” while music is playing, the engine may not cleanly finalize the transcription before more audio arrives. Add a word before or after the command to form a short phrase:| Less reliable | More reliable |
|---|---|
| ”stop" | "please stop the music” / “openhome stop" |
| "pause" | "pause this for a second” / “openhome pause" |
| "play" | "play something chill” / “openhome play” |
TTT — Text-to-Text
The TTT stage is the Agent’s brain. Once the user’s speech has been transcribed by the STT module, the transcribed text is passed to a language model which generates the Agent’s response. The platform and model you choose here directly shapes how intelligently the Agent responds, how well it understands context, and how quickly it replies.OpenAI
OpenAI provides direct access to the GPT model family. Models are accessed using your own OpenAI API key.| Model | Description | Speed |
|---|---|---|
gpt-5.1 | OpenAI’s best model for reasoning and agentic tasks with configurable reasoning effort. Supports adjustable reasoning levels. | Medium |
gpt-4o | OpenAI’s versatile, high-intelligence flagship model. Accepts text and image inputs and supports function calling and streaming. Approximately twice as fast as GPT-4 Turbo. | Medium |
gpt-4 | An older high-intelligence GPT model for chat completions. Previous generation of advanced language model. | Medium |
gpt-3.5-turbo | A legacy GPT model for cost-efficient chat tasks. OpenAI now recommends gpt-4o as a replacement. | Medium |
OpenRouter
OpenRouter is a unified API gateway that provides access to models from multiple AI providers through a single API key. It lets you switch between providers and models without managing separate keys for each.| Model | Provider | Description | Speed |
|---|---|---|---|
openai/gpt-4o | OpenAI | Versatile flagship model. Accepts text and image inputs, supports function calling and streaming. | Medium |
openai/gpt-5-nano | OpenAI | The smallest and fastest variant in the GPT-5 system. Optimized for rapid interactions and ultra-low latency. | Fast |
anthropic/claude-sonnet-4 | Anthropic | Excels at coding and reasoning tasks with improved precision and controllability. Optimized for practical everyday use. | Medium |
anthropic/claude-sonnet-4.5 | Anthropic | Optimized for real-world agents and coding workflows with enhanced tool orchestration and context awareness. | Medium |
anthropic/claude-sonnet-4.6 | Anthropic | Frontier performance across coding, agents, and professional work. Strong at iterative development and complex project management. | Medium |
anthropic/claude-3.7-sonnet | Anthropic | Features hybrid reasoning: standard mode for quick responses and extended reasoning mode for demanding tasks. | Medium |
anthropic/claude-opus-4 | Anthropic | Anthropic’s most capable model. Benchmarked as a top coding model, suited for complex extended workflows. | Medium |
mistralai/mistral-7b-instruct:free | Mistral | A 7B parameter model optimized for speed and context length. Available at no cost. | Fast |
mistralai/mistral-small-3.2-24b-instruct | Mistral | A 24B parameter model optimized for instruction following, repetition reduction, and improved function calling. | Medium |
mistralai/magistral-small-2506 | Mistral | A 24B instruction-tuned reasoning model, enhanced through supervised fine-tuning and reinforcement learning. | Medium |
mistralai/magistral-medium-2506 | Mistral | Mistral’s first reasoning model. Suited for tasks requiring longer thought processing such as legal analysis, financial forecasting, and multi-step reasoning. | Medium |
x-ai/grok-3 | xAI | xAI’s flagship model excelling at enterprise tasks such as data extraction, coding, and text summarization. Strong domain knowledge in finance, healthcare, law, and science. | Medium |
x-ai/grok-4.1-fast | xAI | xAI’s best agentic tool-calling model. Designed for real-world use cases such as customer support and deep research. Reasoning can be toggled on or off. | Fast |
deepseek/deepseek-v3.2 | DeepSeek | Designed for high computational efficiency with strong reasoning and agentic tool-use performance. | Medium |
google/gemini-3-flash-preview | A high-speed thinking model designed for agentic workflows, multi-turn chat, and coding assistance. Strong reasoning with substantially lower latency than larger Gemini variants. | Fast | |
moonshotai/kimi-k2.5 | Moonshot | A multimodal model excelling in general reasoning, visual coding, and agentic tool-calling. | Medium |
minimax/minimax-m2.5 | MiniMax | Designed for real-world productivity, excelling at document generation and coding tasks. | Medium |
arcee-ai/trinity-large-preview:free | Arcee AI | A 400B sparse Mixture-of-Experts model with 13B active parameters per token. Suited for creative writing, storytelling, and agentic tasks. Available at no cost. | Medium |
LLM fine-tuning parameters
These parameters apply to the selected TTT model and affect how the Agent generates responses.| Parameter | Default | What it controls |
|---|---|---|
| Temperature | 0.9 | Controls response randomness. Lower values (closer to 0) produce focused, deterministic responses. Higher values produce more varied and creative responses. |
| Frequency Penalty | 0.2 | Reduces repetition of words and phrases. Higher values discourage the model from repeating the same content. |
| Presence Penalty | 0 | Discourages the model from introducing irrelevant topics. Higher values keep the Agent on the subject at hand. |
TTS — Text-to-Speech
The TTS stage is the Agent’s voice. Once the language model has generated a response, the TTS module converts that text into spoken audio. The platform and model you choose here affects how natural the Agent sounds, how quickly it starts speaking, and which languages it can speak in.ElevenLabs
ElevenLabs is a voice synthesis platform offering high-quality, natural-sounding speech with support for voice cloning and a wide range of languages. Models are accessed using your own ElevenLabs API key.| Model | Description | Latency | Language support |
|---|---|---|---|
eleven_flash_v2_5 | Ultra-fast model optimized for real-time use. The recommended model for low-latency voice agent interactions. | ~75ms | 32 languages including English, Spanish, French, German, Hindi, Japanese, Chinese, Portuguese, Italian, Korean, Dutch, Polish, and more. |
eleven_turbo_v2_5 | First-generation low-latency model. Functional but outclassed by eleven_flash_v2_5, which is recommended instead. | Low | 32 languages. |
eleven_turbo_v2 | First-generation low-latency model for English. Outclassed by eleven_flash_v2_5, which is recommended instead. | Low | English only. |
eleven_multilingual_v2 | ElevenLabs’ most lifelike model with rich emotional expression. Best for use cases where voice quality is the priority over response speed. | Higher | 29 languages including English, Spanish, French, German, Hindi, Japanese, Chinese, Portuguese, Italian, Korean, Arabic, Turkish, and more. |
Voice fine-tuning parameters
| Parameter | Default | What it controls |
|---|---|---|
| Voice Stability | 0.5 | Controls the consistency of the voice across utterances. Lower values produce more varied, expressive delivery. Higher values produce a more stable, consistent tone. |
| Voice Similarity Boost | 0.8 | Controls how closely the synthesized voice matches the original voice model. Higher values produce output that sounds closer to the original voice. |
Per-Agent Voice Configuration
The global configuration in Settings → Configuration applies to all Agents. When you need a specific Agent to use a different voice or pipeline, configure it individually in Pro Creation mode. When creating or editing an Agent in Pro Creation, configure voice under Personality Identity:- Voice Identity: Select from the available voices or enter a custom Voice ID from your TTS provider.
- Clone Voice: Use voice cloning to create a personalized voice.
- Preview: Play back the selected voice before saving.


Adding a Custom Voice
To add a custom voice from your TTS provider:- Go to the Agents dashboard and click the
button at the top right. - Fill in:
- Name: Identifies the voice in your list.
- Description: Tone, accent, or intended use.
- Voice ID: The Voice ID from your TTS provider (e.g., ElevenLabs).
- Click
to add the voice, or
to discard.

Getting a Voice ID
Before adding a voice in OpenHome, upload your custom voice to your preferred TTS provider (e.g., ElevenLabs). Once uploaded, you will receive a Voice ID to enter in the field above.Other Configuration Options
| Setting | Default | What it does |
|---|---|---|
| Wake Word Mode | On | When enabled, the Agent only responds to user turns that include one of the configured wake words. When disabled, the Agent responds to every user turn. Regardless of this setting, a wake word is always required to exit sleep mode. See Wake Word & Sleep Interaction. |
| Wake Word | hey, open home, wake up, hello | The word or phrase the Agent listens for before responding. Multiple wake words can be set by separating them with commas — for example, hello, open home. Any one of the configured wake words is sufficient to address the Agent. Changes require an Agent restart to take effect. |
| Play Filler Audios | Off | When enabled, plays a short audio clip while the Agent is processing a response, so the conversation does not feel silent during generation. |
| Auto Sleep | On | When enabled, the Agent automatically enters sleep mode after a period of inactivity. |
| Auto Sleep Timeout | 60s | The number of seconds of inactivity before the Agent enters sleep mode. Only applies when Auto Sleep is enabled. |
| FuzzyWuzzy Threshold | 80 | Controls how closely a spoken phrase must match a trigger word to activate an Ability. Higher values require a closer match. Lower values are more forgiving of mispronunciations or variations. |
| TTS Daily Credit Limit | 60000 | The maximum number of TTS characters the Agent can synthesize per day. Requests beyond this limit will not produce audio until the limit resets. |
| Utterance Threshold | 750 | The minimum duration in milliseconds that a voice input must last to be processed as a valid utterance. Inputs shorter than this threshold are ignored. |
| Play Audio with Web | Off | When enabled, the Agent plays audio responses through the web interface in addition to any connected device. |
API Keys for Providers
Each provider requires an API key. Manage these under Settings → API Keys.When you update API keys, your Agents consume credits from the associated services. OpenHome is not responsible for charges from third-party providers.See Dashboard for full API key management details.
See also
- Configuring Your Agent — conversation controls, identity, and behavior prompts
- Wake Word and Sleep Interaction — how the wake word interacts with STT and the short-utterance pattern
- SDK Reference — OpenRouter model table for use inside Abilities

