Every method below is accessed through self.capability_worker (the SDK) or self.worker (the Agent). This is the complete toolkit for building any Ability.
Twenty essential SDK methods
| # | Method | What it does | Async? | Object |
|---|
| 1 | speak(text) | Speak text aloud using the Agent’s default voice | Yes | cap_worker |
| 2 | text_to_speech(text, voice_id) | Speak with a specific ElevenLabs voice ID | Yes | cap_worker |
| 3 | user_response() | Wait for the user’s next spoken input, returns string | Yes | cap_worker |
| 4 | wait_for_complete_transcription() | Wait until the user fully finishes speaking before returning | Yes | cap_worker |
| 5 | run_io_loop(text) | Speak text, then wait for user reply (speak + listen combo) | Yes | cap_worker |
| 6 | run_confirmation_loop(text) | Speak text, loop until user says yes or no. Returns bool | Yes | cap_worker |
| 7 | text_to_text_response(prompt, history, system) | Generate LLM text response. The only sync method — no await | No | cap_worker |
| 8 | start_audio_recording() | Begin recording from device mic (runs in background) | No | cap_worker |
| 9 | stop_audio_recording() | Stop the current mic recording | No | cap_worker |
| 10 | get_audio_recording() | Returns recorded audio as .wav bytes | No | cap_worker |
| 11 | play_from_audio_file(filename) | Play an audio file bundled with your Ability | Yes | cap_worker |
| 12 | play_audio(file_content) | Play audio from bytes or file-like object | Yes | cap_worker |
| 13 | resume_normal_flow() | Hand control back to the Personality. Required on every main.py exit | No | cap_worker |
| 14 | send_interrupt_signal() | Stop current assistant output. Call before daemon speak() | Yes | cap_worker |
| 15 | write_file(name, content, temp) | Write or append to persistent or session file storage | Yes | cap_worker |
| 16 | read_file(name, temp) | Read contents of a stored file as string | Yes | cap_worker |
| 17 | check_if_file_exists(name, temp) | Returns bool — always check before reading | Yes | cap_worker |
| 18 | get_full_message_history() | Full conversation transcript from current session | No | cap_worker |
| 19 | get_timezone() | User’s timezone string, e.g. "America/Chicago" | No | cap_worker |
| 20 | session_tasks.create(coro) | Launch a managed async task. Use this instead of asyncio.create_task | No | worker |
Bonus methods
delete_file() · get_audio_recording_length() · flush_audio_recording() · send_data_over_websocket() · send_devkit_action() · get_token() · stream_init() / stream_end() · create_key() / update_key() / delete_key() / get_single_key() / get_all_keys() · update_personality_agent_prompt() · exec_local_command() · session_tasks.sleep()
OpenRouter models
Use OpenRouter (openrouter.ai) as a single API endpoint to access any model. Pick by job: fast/cheap for routing, multimodal for audio, high-quality for user-facing responses.
| # | Model | Speed | Best for | Notes |
|---|
| 1 | google/gemini-2.0-flash-001 | Very fast | Routing, general | Great all-rounder, cheap, supports audio input |
| 2 | google/gemini-2.5-flash-preview | Fast | Deep reasoning | Thinking model, more capable than 2.0 Flash |
| 3 | google/gemini-3-flash-preview | Fast | Audio analysis | Latest generation, strong multimodal |
| 4 | anthropic/claude-sonnet-4 | Medium | Quality responses | Excellent reasoning and tone control |
| 5 | anthropic/claude-haiku-4-5 | Very fast | Routing, speed | Cheapest Anthropic option, solid quality |
| 6 | openai/gpt-4o | Medium | General, vision | Strong all-rounder with multimodal support |
| 7 | openai/gpt-4o-mini | Very fast | Routing, cheap | Fast and affordable for utility tasks |
| 8 | meta-llama/llama-3.3-70b-instruct | Fast | Open source | Great quality, fast via Groq/Cerebras |
| 9 | deepseek/deepseek-r1 | Slow | Deep analysis | Reasoning model, best for complex background tasks |
| 10 | mistralai/mistral-large-latest | Medium | Multilingual | Strong European language support |
Mix models in a single Ability. Use fast/cheap (Gemini Flash, Haiku, GPT-4o-mini) for intent routing and keyword extraction. Use quality models (Claude Sonnet, GPT-4o) for user-facing spoken responses. Use multimodal (Gemini Flash/Pro) for audio analysis.
Battle-tested prompt patterns
Each prompt below is designed for voice output — short, spoken, no markdown.
1. Intent router (JSON classification)
Classify this user input. Return ONLY valid JSON, nothing else.
{"intent": "weather|timer|music|chat", "confidence": 0.0-1.0}
User: {user_input}
Use with text_to_text_response(). Always strip markdown fences before parsing JSON.
2. Persona system prompt (voice character)
You are Marcus, a brutally honest venture capitalist. You speak in short,
punchy sentences. 2-4 sentences max. No markdown, no lists. This is spoken
aloud, not a blog post. Never say "as a VC" or "in my experience".
Use as system_prompt parameter. Keep persona prompts specific about length, format, and forbidden phrases.
3. Audio analysis — Pass 1 (general)
You are an expert audio analyst. Listen carefully to this recording and
provide a detailed analysis. Describe: what type of sound, environment,
acoustic characteristics (rhythm, pitch, texture, layers), and anything
unusual. Do NOT address the user. Write as pure third-person analysis.
Use with an OpenRouter audio-capable model (Gemini Flash/Pro). Send alongside base64 WAV.
4. Audio analysis — Pass 2 (specific with context)
Here is a general analysis already completed:
{general_analysis}
Now answer this specific question about the audio: "{user_question}"
Be precise. Use timestamps and specific details where possible.
Inject Pass 1 results as context. The two-pass pattern hides latency while providing deep answers.
5. Conversational response (with history)
You are [persona] in conversation about [topic].
--- ANALYSIS ---
{analysis}
--- CONVERSATION ---
{chat_history}
The user just said: "{user_input}"
Respond in 1-3 sentences, spoken aloud. Don't repeat yourself.
Inject accumulated analysis + full chat history. Context compounds with every turn.
6. LLM-driven time parser (alarm pattern)
You are an alarm time parser. Current: {now_iso}, Timezone: {tz_name}
If day/date missing, respond: QUESTION:at what day?
If time missing, respond: QUESTION:at what time?
When complete, return ONLY valid JSON:
{"target_iso": "...", "human_time": "...", "timezone": "..."}
Loop up to 6 rounds. If response starts with QUESTION:, ask the user and continue.
Extract a grocery list from this transcript. Organize by section
(produce, dairy, meat, pantry). Deduplicate and clean up.
Transcript: {transcript}
Grocery List:
Turns stream-of-consciousness rambling into structured, organized output.
8. Restart vs continue intent detection
A user is in a conversation about a sound they played. Determine if
they want to listen to a NEW sound (restart) or are asking about the
current sound (continue).
User said: "{user_input}"
Return ONLY valid JSON: {"intent": "restart or continue", "confidence": 0.0}
Two-tier approach: check fast keywords first, fall back to LLM only for ambiguous input.
9. Contextual voice assistant
You are a concise voice assistant for [domain] management.
USER: {name} | LOCATION: {city} | TIME: {current_time}
Rules: Keep responses to 2-4 sentences max. Be conversational.
Never say "as an AI" or "I don't have feelings".
Inject user context (name, location, time) for natural, personalized responses.
10. Farewell / exit summary
The conversation is ending. Here's the full history:
{history}
Give a 1-2 sentence parting thought. If the idea improved during the
conversation, acknowledge it. If not, give one last honest nudge.
Generate a contextual goodbye instead of a generic sign-off. Makes exits feel natural.
Architecture patterns
Ability categories
See Ability Types for the full breakdown.
| Category | Behavior |
|---|
| Skill | Trigger-word Ability. User says hotword → runs a flow → exits with resume_normal_flow() |
| Brain Skill | Personality’s brain auto-triggers when it can’t fully answer or needs to delegate an action |
| Background Daemon | Auto-starts on session. Runs continuously. Works in sleep mode. See Background Abilities |
| Local | Runs directly on Raspberry Pi hardware. Under development — see Local Ability |
File structure
| Type | Files | Description |
|---|
| Standard interactive | main.py only | Triggered by hotwords, runs, exits with resume_normal_flow() |
| Standalone daemon | background.py only | Auto-starts on session. Background monitoring, logging, note-taking |
| Interactive + daemon | main.py + background.py | Interactive handles user requests. Daemon monitors. Coordinate via shared files |
main.py vs background.py
| Aspect | main.py | background.py |
|---|
call() signature | call(self, worker) | call(self, worker, background_daemon_mode) |
| CapabilityWorker init | CapabilityWorker(self) | CapabilityWorker(self) |
| Triggered by | User hotwords | Automatically on session start |
| Lifecycle | Runs once, then exits | Continuous while True loop |
resume_normal_flow() | Required on every exit path | Not needed (independent thread) |
| Works in sleep mode | No | Yes |
Core patterns
The loop template (multi-turn conversation)
Greet → loop (listen → process → respond) → exit on command. Most common pattern for interactive Abilities.
while True:
user_input = await self.capability_worker.user_response()
if any(word in user_input.lower() for word in EXIT_WORDS):
break
response = self.capability_worker.text_to_text_response(user_input)
await self.capability_worker.speak(response)
self.capability_worker.resume_normal_flow()
The two-pass analysis pattern
Pass 1 fires in background immediately (general analysis). While it runs, the Ability talks to the user. Pass 2 fires with Pass 1 context injected, answering the user’s specific question from depth.
- Pass 1: fire-and-forget via
session_tasks.create(asyncio.to_thread(run_general))
- Talk to user while Pass 1 runs (hides 10–15s of latency)
- Pass 2: inject Pass 1 results as context, answer the specific question
- Each follow-up turn fires a background re-analysis, enriching future turns
The rolling window pattern (ambient audio)
For always-on audio monitoring. Continuously record, slice the last N seconds, send to model on a fixed cadence. Fire-and-forget — never await inside the loop.
- 10-second window, 3-second refresh cadence
- API call fires as background task, poll loop never waits
- Responses arrive asynchronously and log themselves
The coordination pattern (main.py + background.py)
Main writes data to persistent file storage. Background polls that file on a timer and acts on it. This is how alarms, reminders, and scheduled tasks work.
main.py: parse user input, write to JSON file, resume_normal_flow()
background.py: poll file every 15–30 seconds, check conditions, act
- Use delete + write for JSON files — append corrupts JSON
- Call
send_interrupt_signal() before speaking from a daemon
The pending state pattern (multi-step collection)
Track what info you’re waiting for with a dictionary. Each loop iteration checks pending state first and routes input to the correct handler.
self.pending_create = {"waiting_for": "title"}
# Next turn: user gives title → update to {"title": "X", "waiting_for": "time"}
# Next turn: user gives time → all info collected, execute action
Sandbox rules
Breaking these rules will fail the Ability scanner.
- Never write
register_capability() by hand — always use the platform tag
- No
import os, no import json at the top level outside the register block
- No raw
open() — use play_from_audio_file() for audio, the file storage API for data
- No
signal module — even in docstrings or comments, the scanner catches it
- Always call
resume_normal_flow() on every exit path in main.py
- Use
session_tasks.sleep() and session_tasks.create() — not raw asyncio
- Wrap all blocking HTTP calls in
asyncio.to_thread()
- No
print() — use editor_logging_handler
- Blocked imports:
redis, connection_manager, user_config, exec(), eval(), pickle
Voice UX best practices
- Keep
speak() to 1–2 sentences. This is voice, not text
- Fill the silence: say “One sec” before any API call over 1 second
- Read your
speak() strings out loud before shipping
- Handle messy voice input: use the LLM to extract clean data from noisy transcription
- Offer exit at every loop iteration: check for “done”, “stop”, “quit”, etc.
- Use
run_confirmation_loop() before destructive actions (send, delete, cancel)
- Idle detection: 1 empty response = keep going, 2 in a row = offer to leave
- Namespace your filenames:
smarthub_prefs.json not data.json
- JSON persistence: always delete + write (append corrupts JSON)
- API calls: always set
timeout=10, wrap in try/except, speak errors to user