Skip to main content
Every method below is accessed through self.capability_worker (the SDK) or self.worker (the Agent). This is the complete toolkit for building any Ability.

Twenty essential SDK methods

#MethodWhat it doesAsync?Object
1speak(text)Speak text aloud using the Agent’s default voiceYescap_worker
2text_to_speech(text, voice_id)Speak with a specific ElevenLabs voice IDYescap_worker
3user_response()Wait for the user’s next spoken input, returns stringYescap_worker
4wait_for_complete_transcription()Wait until the user fully finishes speaking before returningYescap_worker
5run_io_loop(text)Speak text, then wait for user reply (speak + listen combo)Yescap_worker
6run_confirmation_loop(text)Speak text, loop until user says yes or no. Returns boolYescap_worker
7text_to_text_response(prompt, history, system)Generate LLM text response. The only sync method — no awaitNocap_worker
8start_audio_recording()Begin recording from device mic (runs in background)Nocap_worker
9stop_audio_recording()Stop the current mic recordingNocap_worker
10get_audio_recording()Returns recorded audio as .wav bytesNocap_worker
11play_from_audio_file(filename)Play an audio file bundled with your AbilityYescap_worker
12play_audio(file_content)Play audio from bytes or file-like objectYescap_worker
13resume_normal_flow()Hand control back to the Personality. Required on every main.py exitNocap_worker
14send_interrupt_signal()Stop current assistant output. Call before daemon speak()Yescap_worker
15write_file(name, content, temp)Write or append to persistent or session file storageYescap_worker
16read_file(name, temp)Read contents of a stored file as stringYescap_worker
17check_if_file_exists(name, temp)Returns bool — always check before readingYescap_worker
18get_full_message_history()Full conversation transcript from current sessionNocap_worker
19get_timezone()User’s timezone string, e.g. "America/Chicago"Nocap_worker
20session_tasks.create(coro)Launch a managed async task. Use this instead of asyncio.create_taskNoworker

Bonus methods

delete_file() · get_audio_recording_length() · flush_audio_recording() · send_data_over_websocket() · send_devkit_action() · get_token() · stream_init() / stream_end() · create_key() / update_key() / delete_key() / get_single_key() / get_all_keys() · update_personality_agent_prompt() · exec_local_command() · session_tasks.sleep()

OpenRouter models

Use OpenRouter (openrouter.ai) as a single API endpoint to access any model. Pick by job: fast/cheap for routing, multimodal for audio, high-quality for user-facing responses.
#ModelSpeedBest forNotes
1google/gemini-2.0-flash-001Very fastRouting, generalGreat all-rounder, cheap, supports audio input
2google/gemini-2.5-flash-previewFastDeep reasoningThinking model, more capable than 2.0 Flash
3google/gemini-3-flash-previewFastAudio analysisLatest generation, strong multimodal
4anthropic/claude-sonnet-4MediumQuality responsesExcellent reasoning and tone control
5anthropic/claude-haiku-4-5Very fastRouting, speedCheapest Anthropic option, solid quality
6openai/gpt-4oMediumGeneral, visionStrong all-rounder with multimodal support
7openai/gpt-4o-miniVery fastRouting, cheapFast and affordable for utility tasks
8meta-llama/llama-3.3-70b-instructFastOpen sourceGreat quality, fast via Groq/Cerebras
9deepseek/deepseek-r1SlowDeep analysisReasoning model, best for complex background tasks
10mistralai/mistral-large-latestMediumMultilingualStrong European language support
Mix models in a single Ability. Use fast/cheap (Gemini Flash, Haiku, GPT-4o-mini) for intent routing and keyword extraction. Use quality models (Claude Sonnet, GPT-4o) for user-facing spoken responses. Use multimodal (Gemini Flash/Pro) for audio analysis.

Battle-tested prompt patterns

Each prompt below is designed for voice output — short, spoken, no markdown.

1. Intent router (JSON classification)

Classify this user input. Return ONLY valid JSON, nothing else.
{"intent": "weather|timer|music|chat", "confidence": 0.0-1.0}
User: {user_input}
Use with text_to_text_response(). Always strip markdown fences before parsing JSON.

2. Persona system prompt (voice character)

You are Marcus, a brutally honest venture capitalist. You speak in short,
punchy sentences. 2-4 sentences max. No markdown, no lists. This is spoken
aloud, not a blog post. Never say "as a VC" or "in my experience".
Use as system_prompt parameter. Keep persona prompts specific about length, format, and forbidden phrases.

3. Audio analysis — Pass 1 (general)

You are an expert audio analyst. Listen carefully to this recording and
provide a detailed analysis. Describe: what type of sound, environment,
acoustic characteristics (rhythm, pitch, texture, layers), and anything
unusual. Do NOT address the user. Write as pure third-person analysis.
Use with an OpenRouter audio-capable model (Gemini Flash/Pro). Send alongside base64 WAV.

4. Audio analysis — Pass 2 (specific with context)

Here is a general analysis already completed:
{general_analysis}

Now answer this specific question about the audio: "{user_question}"
Be precise. Use timestamps and specific details where possible.
Inject Pass 1 results as context. The two-pass pattern hides latency while providing deep answers.

5. Conversational response (with history)

You are [persona] in conversation about [topic].
--- ANALYSIS ---
{analysis}
--- CONVERSATION ---
{chat_history}

The user just said: "{user_input}"
Respond in 1-3 sentences, spoken aloud. Don't repeat yourself.
Inject accumulated analysis + full chat history. Context compounds with every turn.

6. LLM-driven time parser (alarm pattern)

You are an alarm time parser. Current: {now_iso}, Timezone: {tz_name}
If day/date missing, respond: QUESTION:at what day?
If time missing, respond: QUESTION:at what time?
When complete, return ONLY valid JSON:
{"target_iso": "...", "human_time": "...", "timezone": "..."}
Loop up to 6 rounds. If response starts with QUESTION:, ask the user and continue.

7. Grocery list extractor

Extract a grocery list from this transcript. Organize by section
(produce, dairy, meat, pantry). Deduplicate and clean up.

Transcript: {transcript}
Grocery List:
Turns stream-of-consciousness rambling into structured, organized output.

8. Restart vs continue intent detection

A user is in a conversation about a sound they played. Determine if
they want to listen to a NEW sound (restart) or are asking about the
current sound (continue).

User said: "{user_input}"
Return ONLY valid JSON: {"intent": "restart or continue", "confidence": 0.0}
Two-tier approach: check fast keywords first, fall back to LLM only for ambiguous input.

9. Contextual voice assistant

You are a concise voice assistant for [domain] management.
USER: {name} | LOCATION: {city} | TIME: {current_time}

Rules: Keep responses to 2-4 sentences max. Be conversational.
Never say "as an AI" or "I don't have feelings".
Inject user context (name, location, time) for natural, personalized responses.

10. Farewell / exit summary

The conversation is ending. Here's the full history:
{history}

Give a 1-2 sentence parting thought. If the idea improved during the
conversation, acknowledge it. If not, give one last honest nudge.
Generate a contextual goodbye instead of a generic sign-off. Makes exits feel natural.

Architecture patterns

Ability categories

See Ability Types for the full breakdown.
CategoryBehavior
SkillTrigger-word Ability. User says hotword → runs a flow → exits with resume_normal_flow()
Brain SkillPersonality’s brain auto-triggers when it can’t fully answer or needs to delegate an action
Background DaemonAuto-starts on session. Runs continuously. Works in sleep mode. See Background Abilities
LocalRuns directly on Raspberry Pi hardware. Under development — see Local Ability

File structure

TypeFilesDescription
Standard interactivemain.py onlyTriggered by hotwords, runs, exits with resume_normal_flow()
Standalone daemonbackground.py onlyAuto-starts on session. Background monitoring, logging, note-taking
Interactive + daemonmain.py + background.pyInteractive handles user requests. Daemon monitors. Coordinate via shared files

main.py vs background.py

Aspectmain.pybackground.py
call() signaturecall(self, worker)call(self, worker, background_daemon_mode)
CapabilityWorker initCapabilityWorker(self)CapabilityWorker(self)
Triggered byUser hotwordsAutomatically on session start
LifecycleRuns once, then exitsContinuous while True loop
resume_normal_flow()Required on every exit pathNot needed (independent thread)
Works in sleep modeNoYes

Core patterns

The loop template (multi-turn conversation)

Greet → loop (listen → process → respond) → exit on command. Most common pattern for interactive Abilities.
while True:
    user_input = await self.capability_worker.user_response()
    if any(word in user_input.lower() for word in EXIT_WORDS):
        break
    response = self.capability_worker.text_to_text_response(user_input)
    await self.capability_worker.speak(response)

self.capability_worker.resume_normal_flow()

The two-pass analysis pattern

Pass 1 fires in background immediately (general analysis). While it runs, the Ability talks to the user. Pass 2 fires with Pass 1 context injected, answering the user’s specific question from depth.
  • Pass 1: fire-and-forget via session_tasks.create(asyncio.to_thread(run_general))
  • Talk to user while Pass 1 runs (hides 10–15s of latency)
  • Pass 2: inject Pass 1 results as context, answer the specific question
  • Each follow-up turn fires a background re-analysis, enriching future turns

The rolling window pattern (ambient audio)

For always-on audio monitoring. Continuously record, slice the last N seconds, send to model on a fixed cadence. Fire-and-forget — never await inside the loop.
  • 10-second window, 3-second refresh cadence
  • API call fires as background task, poll loop never waits
  • Responses arrive asynchronously and log themselves

The coordination pattern (main.py + background.py)

Main writes data to persistent file storage. Background polls that file on a timer and acts on it. This is how alarms, reminders, and scheduled tasks work.
  • main.py: parse user input, write to JSON file, resume_normal_flow()
  • background.py: poll file every 15–30 seconds, check conditions, act
  • Use delete + write for JSON files — append corrupts JSON
  • Call send_interrupt_signal() before speaking from a daemon

The pending state pattern (multi-step collection)

Track what info you’re waiting for with a dictionary. Each loop iteration checks pending state first and routes input to the correct handler.
self.pending_create = {"waiting_for": "title"}
# Next turn: user gives title → update to {"title": "X", "waiting_for": "time"}
# Next turn: user gives time → all info collected, execute action

Sandbox rules

Breaking these rules will fail the Ability scanner.
  • Never write register_capability() by hand — always use the platform tag
  • No import os, no import json at the top level outside the register block
  • No raw open() — use play_from_audio_file() for audio, the file storage API for data
  • No signal module — even in docstrings or comments, the scanner catches it
  • Always call resume_normal_flow() on every exit path in main.py
  • Use session_tasks.sleep() and session_tasks.create() — not raw asyncio
  • Wrap all blocking HTTP calls in asyncio.to_thread()
  • No print() — use editor_logging_handler
  • Blocked imports: redis, connection_manager, user_config, exec(), eval(), pickle

Voice UX best practices

  • Keep speak() to 1–2 sentences. This is voice, not text
  • Fill the silence: say “One sec” before any API call over 1 second
  • Read your speak() strings out loud before shipping
  • Handle messy voice input: use the LLM to extract clean data from noisy transcription
  • Offer exit at every loop iteration: check for “done”, “stop”, “quit”, etc.
  • Use run_confirmation_loop() before destructive actions (send, delete, cancel)
  • Idle detection: 1 empty response = keep going, 2 in a row = offer to leave
  • Namespace your filenames: smarthub_prefs.json not data.json
  • JSON persistence: always delete + write (append corrupts JSON)
  • API calls: always set timeout=10, wrap in try/except, speak errors to user