Voice-First Best Practices

A well-built Ability feels like a person in the room, not a menu you’re navigating. These are the rules that keep it that way.

The three modes

Every Ability operates in one of three modes at any moment. Knowing which one you’re in is the first design decision.

Mode	What it does	Key principle
Listening	Captures ambient audio, transcribes speech, identifies speakers, detects sounds, extracts meaning	The user may not even be talking to the device
Speaking	Interjects, responds, narrates, coaches, entertains	Voice is expensive — every word is a second the user can’t skip. Silence is often better.
Logging	Writes to persistent backends, companion apps, dashboards — silently	Accumulates intelligence over hours, days, weeks. The most powerful layer.

Design rules

1. Keep it short

1–2 sentences per speak() call
Give the headline first, offer to go deeper
Progressive disclosure: “You have 3 meetings. Next one’s at 2 with Sarah. Want the full list?”

If you wouldn’t say it to someone standing next to you, it doesn’t belong in a speak() call.

2. Fill the silence

If an API call takes more than 1 second, say something first
“One sec, pulling that up.” / “Hang on, checking.” / “Let me look into that.”
Dead silence during processing feels like the conversation froze

Speak filler before the slow call, not after. The user hears words while the API loads.

3. Confirm before acting

Destructive or high-stakes actions need a voice confirmation
“Cancel Team Standup? Say yes to confirm.”
Low-stakes lookups can skip confirmation — just do it

Use run_confirmation_loop() for confirmations — it handles the yes/no loop for you.

4. Expect messy input

Transcription isn’t perfect. Users say “um”, trail off, repeat themselves
Use the LLM to extract clean data from noisy transcription
If you can’t parse it, ask again: “I didn’t catch that — could you say it again?”

Never fail silently. A confused response is better than no response.

5. Handle exits

If your Ability loops, give users a way out
Check for exit words: done, stop, bye, nothing else, I'm good
One idle cycle = keep going. Two = offer to leave.

Call resume_normal_flow() on every exit path — happy path, breaks, except blocks, timeouts. The #1 bug in Abilities is forgetting it somewhere.

6. Spell it out

TTS will mangle emails, URLs, and number formats.

Say “at” not @, “dot” not .
Read phone numbers digit by digit
Say “10 AM”, not "10:00"

7. Silence is a feature

Not every moment needs a response
User said something interesting? Log it. Don’t acknowledge it.
User paused for 5 seconds? That’s not a prompt for you to fill
Voice is serial — never list more than 3 items without asking

Sound design

Voice Abilities aren’t just speech — they’re audio experiences. A well-placed sound effect communicates faster than words. The difference between a toy and a product is sound design.

Sound effect types

Type	When to use	Example
Confirmation tones	Action completes successfully. Low-stakes.	”Lights off” → [soft click] — no words needed
Transition sounds	Switching modes or states. <1 second.	Entering Ability → [whoosh] signals mode change
Intro music / themes	Companion and game Abilities. 2–4 sec.	Trivia → [game-show sting] = instant mode recognition
Feedback beeps	Correct/wrong, milestones, timers	Correct → [bright pip], wrong → [low tone]
Ambient audio	Atmosphere under speech. −20dB below voice.	Focus mode → [lo-fi beats], sleep → [rain sounds]
Alert / interrupt	Watcher Abilities breaking through	Timer done → [escalating soft alarm]

Principles

Less is more

A single well-chosen tone beats a symphony of effects
If every action has a sound, nothing stands out — sound inflation kills meaning

Consistency builds trust

Same action = same sound, every time
Users learn the audio language: “I heard the ding, so I know it worked.”

Time of day awareness

Morning sounds: bright, warm, energizing
Evening sounds: soft, muted, calm
Late night sounds: minimal, whisper-quiet, or absent

The same Ability should sound different at 7 AM vs. 11 PM. Time-of-day gating on alert sounds is mandatory.

Sound as progressive disclosure

First interaction: sound + full speech confirmation
After 5 uses: sound + abbreviated speech
After 20 uses: sound only — user knows what it means

Let the sound gradually replace the words as the user learns. This is how you train subconscious familiarity.

Anti-patterns

Sound effect on every speak() call

Becomes noise. Users stop hearing the cues.

Long intro music that delays the first useful word

Voice AI lives or dies on latency. Don’t add latency for flourish.

Loud alert sounds at 2 AM

Time-of-day gating is mandatory.

Sounds that mimic real-world alarms

Fire alarms, car horns, sirens — they cause panic. Don’t use them as notifications.

Musical loops that don't fade when speech starts

Mixing matters. Background audio must duck under voice.

Different sounds for the same action

Breaks learned association. Users stop trusting what they hear.

Trigger word design

Think in speech, not text

Users won’t say “invoke calendar management system”
They’ll say “what’s on my calendar”, “do I have a 3pm”, “am I free Tuesday”

Test triggers by saying them out loud across a room. If it feels unnatural to say, nobody will say it.

Balance coverage vs. false positives

Trigger risk	Examples	Strategy
Safe single words	`calendar`, `reschedule`, `weather`	Unambiguous — use freely
Dangerous single words	`book`, `free`, `cancel`	Multiple meanings — use phrase-level triggers
Phrase-level triggers	`book a time`, `am I free`, `free on`	Much safer than bare words
Full-sentence triggers	`what's my day look like today`	Catches indirect queries without keywords

Trigger word checklist

Include plural forms: meeting AND meetings
Include regional variants: what's in my diary (UK) vs. what's on my calendar (US)
Include indirect phrasings: “what’s my day look like” has no calendar keyword
Include natural full sentences: “what am I doing today”

Read trigger context

When your Ability fires, the user was mid-conversation. Read that history to classify intent:

“What’s on my calendar today?” → give today’s schedule
“Create a meeting with Sarah at 3” → start creating immediately, no menus

Pattern: read trigger from history → classify intent with LLM → route to handler. Don’t treat every activation the same.

Ability lifecycle

How it actually works

User is in Main Flow

Having a normal conversation with the Personality.

Trigger word matches

User says something matching your Ability’s trigger.

Main Flow calls your call()

Your Ability takes over.

You speak, listen, act

Whatever logic your Ability runs.

Return control

Call resume_normal_flow() — user is back in Main Flow.

Key implications

You can read conversation history from before your trigger
Anything you say via speak() enters the Personality’s conversation history
You cannot silently inject text — the agent has to say it out loud
You must always hand control back or the Personality goes silent

Quick mode vs. full mode

Classify at trigger time — not after a menu. The user’s phrasing tells you which experience they expect.

User says	Mode	Why
”Play jazz”	Quick — just do it	Phrasing is an instruction
”Help me build a playlist”	Full — enter an interactive loop	Phrasing invites collaboration
”Turn off the lights”	Quick	Direct command
”Set up my evening routine”	Full	Open-ended setup

The four Ability modes

Mode	Trigger	Behavior	Examples
Interactive	User voice trigger	Takes over conversation, hands back when done	Weather, calendar, recipe walkthrough
Autonomous	Agent-triggered	No user initiation. System decides when to fire.	Proactive weather alert, smart reminder
Smart	Agent-triggered	Works silently, surfaces questions only when needed	Email draft needing approval
Watcher	Always running	Continuous. No user input ever. Monitors everything.	Meeting note-taker, life logger, alarm system

See Ability Types and Background Abilities for the full reference.

The `ability.md` pattern

Every Ability ships with an ability.md file — YAML frontmatter (name + description) and markdown body (instructions). The description field is the ONLY field the system reads to decide when to trigger.

Bad description = never triggers, or triggers incorrectly. This is the single most important field for Agent-triggered Abilities.

Quality checklist

Before you ship:

Audio LLMs OpenHome Agent Design Guide

⌘I

​The three modes

​Design rules

​1. Keep it short

​2. Fill the silence

​3. Confirm before acting

​4. Expect messy input

​5. Handle exits

​6. Spell it out

​7. Silence is a feature

​Sound design

​Sound effect types

​Principles

​Less is more

​Consistency builds trust

​Time of day awareness

​Sound as progressive disclosure

​Anti-patterns

​Trigger word design

​Think in speech, not text

​Balance coverage vs. false positives

​Trigger word checklist

​Read trigger context

​Ability lifecycle

​How it actually works

​Key implications

​Quick mode vs. full mode

​The four Ability modes

​The ability.md pattern

​Quality checklist

The three modes

Design rules

1. Keep it short

2. Fill the silence

3. Confirm before acting

4. Expect messy input

5. Handle exits

6. Spell it out

7. Silence is a feature

Sound design

Sound effect types

Principles

Less is more

Consistency builds trust

Time of day awareness

Sound as progressive disclosure

Anti-patterns

Trigger word design

Think in speech, not text

Balance coverage vs. false positives

Trigger word checklist

Read trigger context

Ability lifecycle

How it actually works

Key implications

Quick mode vs. full mode

The four Ability modes

The `ability.md` pattern

Quality checklist