Skip to main content
Every AI audio feature ever built does the same thing: it listens for words and converts them to text. The audio is stripped of everything except language — the grain of a voice, the breath, the room, the texture — all discarded. OpenHome + OpenRouter’s multimodal audio unlocks something different: an AI that genuinely listens.

What audio intelligence unlocks

DomainWhat the LLM hears
Music productionSpace between notes, tempo drift, mix imbalance — what Rick Rubin hears
Home safetySmoke alarms, breaking glass, CO alerts by acoustic signature, not keywords
MedicalBreath sounds — wheeze, crackle, deviations from baseline
AutomotiveEngine knock, rattle, bearing wear before the warning light
Wildlife researchSpecies identification by call, behavioral patterns
Language learningPronunciation, prosody, accent drift
None of these are transcription problems. They are listening problems.

The core pattern

import base64, requests

# 1. Capture audio with the hot mic
self.capability_worker.start_audio_recording()
await self.worker.session_tasks.sleep(10)  # or wait for a stop command
self.capability_worker.stop_audio_recording()
audio_bytes = self.capability_worker.get_audio_recording()

# 2. Send to a multimodal audio model via OpenRouter
response = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers={"Authorization": "Bearer YOUR_OPENROUTER_KEY"},
    json={
        "model": "google/gemini-2.5-flash-preview",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this sound. What's happening?"},
                {"type": "input_audio", "input_audio": {
                    "data": base64.b64encode(audio_bytes).decode(),
                    "format": "wav",
                }},
            ],
        }],
    },
    timeout=30,
)
analysis = response.json()["choices"][0]["message"]["content"]
await self.capability_worker.speak(analysis)
Use caseModel
General audio reasoninggoogle/gemini-2.5-flash-preview
Deepest audio analysisgoogle/gemini-3-flash-preview (latest multimodal)
Transcription-focused (speech only)Deepgram Nova-3
For the full model matrix, see the SDK Reference.

Two-pass analysis

Multimodal audio can be slow (10–15s). Use the two-pass pattern to hide latency:
  1. Pass 1 (fire-and-forget): send audio for general analysis while the Ability talks to the user
  2. Pass 2 (on-demand): when the user asks a specific question, inject Pass 1’s result and answer with depth

Next steps