Skip to main content
A well-built Ability feels like a person in the room, not a menu you’re navigating. These are the rules that keep it that way.

The three modes

Every Ability operates in one of three modes at any moment. Knowing which one you’re in is the first design decision.
ModeWhat it doesKey principle
ListeningCaptures ambient audio, transcribes speech, identifies speakers, detects sounds, extracts meaningThe user may not even be talking to the device
SpeakingInterjects, responds, narrates, coaches, entertainsVoice is expensive — every word is a second the user can’t skip. Silence is often better.
LoggingWrites to persistent backends, companion apps, dashboards — silentlyAccumulates intelligence over hours, days, weeks. The most powerful layer.

Design rules

1. Keep it short

  • 1–2 sentences per speak() call
  • Give the headline first, offer to go deeper
  • Progressive disclosure: “You have 3 meetings. Next one’s at 2 with Sarah. Want the full list?”
If you wouldn’t say it to someone standing next to you, it doesn’t belong in a speak() call.

2. Fill the silence

  • If an API call takes more than 1 second, say something first
  • “One sec, pulling that up.” / “Hang on, checking.” / “Let me look into that.”
  • Dead silence during processing feels like the conversation froze
Speak filler before the slow call, not after. The user hears words while the API loads.

3. Confirm before acting

  • Destructive or high-stakes actions need a voice confirmation
  • “Cancel Team Standup? Say yes to confirm.”
  • Low-stakes lookups can skip confirmation — just do it
Use run_confirmation_loop() for confirmations — it handles the yes/no loop for you.

4. Expect messy input

  • Transcription isn’t perfect. Users say “um”, trail off, repeat themselves
  • Use the LLM to extract clean data from noisy transcription
  • If you can’t parse it, ask again: “I didn’t catch that — could you say it again?”
Never fail silently. A confused response is better than no response.

5. Handle exits

  • If your Ability loops, give users a way out
  • Check for exit words: done, stop, bye, nothing else, I'm good
  • One idle cycle = keep going. Two = offer to leave.
Call resume_normal_flow() on every exit path — happy path, breaks, except blocks, timeouts. The #1 bug in Abilities is forgetting it somewhere.

6. Spell it out

TTS will mangle emails, URLs, and number formats.
  • Say “at” not @, “dot” not .
  • Read phone numbers digit by digit
  • Say “10 AM”, not "10:00"

7. Silence is a feature

  • Not every moment needs a response
  • User said something interesting? Log it. Don’t acknowledge it.
  • User paused for 5 seconds? That’s not a prompt for you to fill
  • Voice is serial — never list more than 3 items without asking

Sound design

Voice Abilities aren’t just speech — they’re audio experiences. A well-placed sound effect communicates faster than words. The difference between a toy and a product is sound design.

Sound effect types

TypeWhen to useExample
Confirmation tonesAction completes successfully. Low-stakes.”Lights off” → [soft click] — no words needed
Transition soundsSwitching modes or states. <1 second.Entering Ability → [whoosh] signals mode change
Intro music / themesCompanion and game Abilities. 2–4 sec.Trivia → [game-show sting] = instant mode recognition
Feedback beepsCorrect/wrong, milestones, timersCorrect → [bright pip], wrong → [low tone]
Ambient audioAtmosphere under speech. −20dB below voice.Focus mode → [lo-fi beats], sleep → [rain sounds]
Alert / interruptWatcher Abilities breaking throughTimer done → [escalating soft alarm]

Principles

Less is more

  • A single well-chosen tone beats a symphony of effects
  • If every action has a sound, nothing stands out — sound inflation kills meaning

Consistency builds trust

  • Same action = same sound, every time
  • Users learn the audio language: “I heard the ding, so I know it worked.”

Time of day awareness

  • Morning sounds: bright, warm, energizing
  • Evening sounds: soft, muted, calm
  • Late night sounds: minimal, whisper-quiet, or absent
The same Ability should sound different at 7 AM vs. 11 PM. Time-of-day gating on alert sounds is mandatory.

Sound as progressive disclosure

  • First interaction: sound + full speech confirmation
  • After 5 uses: sound + abbreviated speech
  • After 20 uses: sound only — user knows what it means
Let the sound gradually replace the words as the user learns. This is how you train subconscious familiarity.

Anti-patterns

Becomes noise. Users stop hearing the cues.
Voice AI lives or dies on latency. Don’t add latency for flourish.
Time-of-day gating is mandatory.
Fire alarms, car horns, sirens — they cause panic. Don’t use them as notifications.
Mixing matters. Background audio must duck under voice.
Breaks learned association. Users stop trusting what they hear.

Trigger word design

Think in speech, not text

  • Users won’t say “invoke calendar management system”
  • They’ll say “what’s on my calendar”, “do I have a 3pm”, “am I free Tuesday”
Test triggers by saying them out loud across a room. If it feels unnatural to say, nobody will say it.

Balance coverage vs. false positives

Trigger riskExamplesStrategy
Safe single wordscalendar, reschedule, weatherUnambiguous — use freely
Dangerous single wordsbook, free, cancelMultiple meanings — use phrase-level triggers
Phrase-level triggersbook a time, am I free, free onMuch safer than bare words
Full-sentence triggerswhat's my day look like todayCatches indirect queries without keywords

Trigger word checklist

  • Include plural forms: meeting AND meetings
  • Include regional variants: what's in my diary (UK) vs. what's on my calendar (US)
  • Include indirect phrasings: “what’s my day look like” has no calendar keyword
  • Include natural full sentences: “what am I doing today”

Read trigger context

When your Ability fires, the user was mid-conversation. Read that history to classify intent:
  • “What’s on my calendar today?” → give today’s schedule
  • “Create a meeting with Sarah at 3” → start creating immediately, no menus
Pattern: read trigger from history → classify intent with LLM → route to handler. Don’t treat every activation the same.

Ability lifecycle

How it actually works

1

User is in Main Flow

Having a normal conversation with the Personality.
2

Trigger word matches

User says something matching your Ability’s trigger.
3

Main Flow calls your call()

Your Ability takes over.
4

You speak, listen, act

Whatever logic your Ability runs.
5

Return control

Call resume_normal_flow() — user is back in Main Flow.

Key implications

  • You can read conversation history from before your trigger
  • Anything you say via speak() enters the Personality’s conversation history
  • You cannot silently inject text — the agent has to say it out loud
  • You must always hand control back or the Personality goes silent

Quick mode vs. full mode

Classify at trigger time — not after a menu. The user’s phrasing tells you which experience they expect.
User saysModeWhy
”Play jazz”Quick — just do itPhrasing is an instruction
”Help me build a playlist”Full — enter an interactive loopPhrasing invites collaboration
”Turn off the lights”QuickDirect command
”Set up my evening routine”FullOpen-ended setup

The four Ability modes

ModeTriggerBehaviorExamples
InteractiveUser voice triggerTakes over conversation, hands back when doneWeather, calendar, recipe walkthrough
AutonomousBrain-triggeredNo user initiation. System decides when to fire.Proactive weather alert, smart reminder
SmartBrain-triggeredWorks silently, surfaces questions only when neededEmail draft needing approval
WatcherAlways runningContinuous. No user input ever. Monitors everything.Meeting note-taker, life logger, alarm system
See Ability Types and Background Abilities for the full reference.

The ability.md pattern

Every Ability ships with an ability.md file — YAML frontmatter (name + description) and markdown body (instructions). The description field is the ONLY field the system reads to decide when to trigger.
Bad description = never triggers, or triggers incorrectly. This is the single most important field for brain-triggered Abilities.

Quality checklist

Before you ship:
  • Every speak() call read aloud — does it flow?
  • Filler text before every API call >1s
  • run_confirmation_loop() before every destructive action
  • Exit words handled in every loop
  • resume_normal_flow() on every exit path
  • Emails/URLs/numbers pronounced correctly
  • Triggers tested by saying them out loud
  • Sound effects only where they earn their place
  • Time-of-day gating on any alert sound
  • ability.md description describes when to trigger, not what the Ability does internally