The three modes
Every Ability operates in one of three modes at any moment. Knowing which one you’re in is the first design decision.| Mode | What it does | Key principle |
|---|---|---|
| Listening | Captures ambient audio, transcribes speech, identifies speakers, detects sounds, extracts meaning | The user may not even be talking to the device |
| Speaking | Interjects, responds, narrates, coaches, entertains | Voice is expensive — every word is a second the user can’t skip. Silence is often better. |
| Logging | Writes to persistent backends, companion apps, dashboards — silently | Accumulates intelligence over hours, days, weeks. The most powerful layer. |
Design rules
1. Keep it short
- 1–2 sentences per
speak()call - Give the headline first, offer to go deeper
- Progressive disclosure: “You have 3 meetings. Next one’s at 2 with Sarah. Want the full list?”
2. Fill the silence
- If an API call takes more than 1 second, say something first
- “One sec, pulling that up.” / “Hang on, checking.” / “Let me look into that.”
- Dead silence during processing feels like the conversation froze
3. Confirm before acting
- Destructive or high-stakes actions need a voice confirmation
- “Cancel Team Standup? Say yes to confirm.”
- Low-stakes lookups can skip confirmation — just do it
run_confirmation_loop() for confirmations — it handles the yes/no loop for you.
4. Expect messy input
- Transcription isn’t perfect. Users say “um”, trail off, repeat themselves
- Use the LLM to extract clean data from noisy transcription
- If you can’t parse it, ask again: “I didn’t catch that — could you say it again?”
5. Handle exits
- If your Ability loops, give users a way out
- Check for exit words:
done,stop,bye,nothing else,I'm good - One idle cycle = keep going. Two = offer to leave.
6. Spell it out
TTS will mangle emails, URLs, and number formats.- Say “at” not
@, “dot” not. - Read phone numbers digit by digit
- Say “10 AM”, not
"10:00"
7. Silence is a feature
- Not every moment needs a response
- User said something interesting? Log it. Don’t acknowledge it.
- User paused for 5 seconds? That’s not a prompt for you to fill
- Voice is serial — never list more than 3 items without asking
Sound design
Voice Abilities aren’t just speech — they’re audio experiences. A well-placed sound effect communicates faster than words. The difference between a toy and a product is sound design.Sound effect types
| Type | When to use | Example |
|---|---|---|
| Confirmation tones | Action completes successfully. Low-stakes. | ”Lights off” → [soft click] — no words needed |
| Transition sounds | Switching modes or states. <1 second. | Entering Ability → [whoosh] signals mode change |
| Intro music / themes | Companion and game Abilities. 2–4 sec. | Trivia → [game-show sting] = instant mode recognition |
| Feedback beeps | Correct/wrong, milestones, timers | Correct → [bright pip], wrong → [low tone] |
| Ambient audio | Atmosphere under speech. −20dB below voice. | Focus mode → [lo-fi beats], sleep → [rain sounds] |
| Alert / interrupt | Watcher Abilities breaking through | Timer done → [escalating soft alarm] |
Principles
Less is more
- A single well-chosen tone beats a symphony of effects
- If every action has a sound, nothing stands out — sound inflation kills meaning
Consistency builds trust
- Same action = same sound, every time
- Users learn the audio language: “I heard the ding, so I know it worked.”
Time of day awareness
- Morning sounds: bright, warm, energizing
- Evening sounds: soft, muted, calm
- Late night sounds: minimal, whisper-quiet, or absent
Sound as progressive disclosure
- First interaction: sound + full speech confirmation
- After 5 uses: sound + abbreviated speech
- After 20 uses: sound only — user knows what it means
Anti-patterns
Sound effect on every speak() call
Sound effect on every speak() call
Becomes noise. Users stop hearing the cues.
Long intro music that delays the first useful word
Long intro music that delays the first useful word
Voice AI lives or dies on latency. Don’t add latency for flourish.
Loud alert sounds at 2 AM
Loud alert sounds at 2 AM
Time-of-day gating is mandatory.
Sounds that mimic real-world alarms
Sounds that mimic real-world alarms
Fire alarms, car horns, sirens — they cause panic. Don’t use them as notifications.
Musical loops that don't fade when speech starts
Musical loops that don't fade when speech starts
Mixing matters. Background audio must duck under voice.
Different sounds for the same action
Different sounds for the same action
Breaks learned association. Users stop trusting what they hear.
Trigger word design
Think in speech, not text
- Users won’t say “invoke calendar management system”
- They’ll say “what’s on my calendar”, “do I have a 3pm”, “am I free Tuesday”
Balance coverage vs. false positives
| Trigger risk | Examples | Strategy |
|---|---|---|
| Safe single words | calendar, reschedule, weather | Unambiguous — use freely |
| Dangerous single words | book, free, cancel | Multiple meanings — use phrase-level triggers |
| Phrase-level triggers | book a time, am I free, free on | Much safer than bare words |
| Full-sentence triggers | what's my day look like today | Catches indirect queries without keywords |
Trigger word checklist
- Include plural forms:
meetingANDmeetings - Include regional variants:
what's in my diary(UK) vs.what's on my calendar(US) - Include indirect phrasings: “what’s my day look like” has no calendar keyword
- Include natural full sentences: “what am I doing today”
Read trigger context
When your Ability fires, the user was mid-conversation. Read that history to classify intent:- “What’s on my calendar today?” → give today’s schedule
- “Create a meeting with Sarah at 3” → start creating immediately, no menus
Ability lifecycle
How it actually works
Key implications
- You can read conversation history from before your trigger
- Anything you say via
speak()enters the Personality’s conversation history - You cannot silently inject text — the agent has to say it out loud
- You must always hand control back or the Personality goes silent
Quick mode vs. full mode
Classify at trigger time — not after a menu. The user’s phrasing tells you which experience they expect.| User says | Mode | Why |
|---|---|---|
| ”Play jazz” | Quick — just do it | Phrasing is an instruction |
| ”Help me build a playlist” | Full — enter an interactive loop | Phrasing invites collaboration |
| ”Turn off the lights” | Quick | Direct command |
| ”Set up my evening routine” | Full | Open-ended setup |
The four Ability modes
| Mode | Trigger | Behavior | Examples |
|---|---|---|---|
| Interactive | User voice trigger | Takes over conversation, hands back when done | Weather, calendar, recipe walkthrough |
| Autonomous | Brain-triggered | No user initiation. System decides when to fire. | Proactive weather alert, smart reminder |
| Smart | Brain-triggered | Works silently, surfaces questions only when needed | Email draft needing approval |
| Watcher | Always running | Continuous. No user input ever. Monitors everything. | Meeting note-taker, life logger, alarm system |
The ability.md pattern
Every Ability ships with an ability.md file — YAML frontmatter (name + description) and markdown body (instructions). The description field is the ONLY field the system reads to decide when to trigger.
Quality checklist
Before you ship:- Every
speak()call read aloud — does it flow? - Filler text before every API call >1s
-
run_confirmation_loop()before every destructive action - Exit words handled in every loop
-
resume_normal_flow()on every exit path - Emails/URLs/numbers pronounced correctly
- Triggers tested by saying them out loud
- Sound effects only where they earn their place
- Time-of-day gating on any alert sound
-
ability.mddescription describes when to trigger, not what the Ability does internally

