Audio LLMs - OpenHome

Every AI audio feature ever built does the same thing: it listens for words and converts them to text. The audio is stripped of everything except language — the grain of a voice, the breath, the room, the texture — all discarded. OpenHome + OpenRouter’s multimodal audio unlocks something different: an AI that genuinely listens.

What audio intelligence unlocks

Domain	What the LLM hears
Music production	Space between notes, tempo drift, mix imbalance — what Rick Rubin hears
Home safety	Smoke alarms, breaking glass, CO alerts by acoustic signature, not keywords
Medical	Breath sounds — wheeze, crackle, deviations from baseline
Automotive	Engine knock, rattle, bearing wear before the warning light
Wildlife research	Species identification by call, behavioral patterns
Language learning	Pronunciation, prosody, accent drift

None of these are transcription problems. They are listening problems.

The core pattern

import base64, requests

# 1. Capture audio with the hot mic
self.capability_worker.start_audio_recording()
await self.worker.session_tasks.sleep(10)  # or wait for a stop command
self.capability_worker.stop_audio_recording()
audio_bytes = self.capability_worker.get_audio_recording()

# 2. Send to a multimodal audio model via OpenRouter
response = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers={"Authorization": "Bearer YOUR_OPENROUTER_KEY"},
    json={
        "model": "google/gemini-2.5-flash-preview",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this sound. What's happening?"},
                {"type": "input_audio", "input_audio": {
                    "data": base64.b64encode(audio_bytes).decode(),
                    "format": "wav",
                }},
            ],
        }],
    },
    timeout=30,
)
analysis = response.json()["choices"][0]["message"]["content"]
await self.capability_worker.speak(analysis)

Recommended models

Use case	Model
General audio reasoning	`google/gemini-2.5-flash-preview`
Deepest audio analysis	`google/gemini-3-flash-preview` (latest multimodal)
Transcription-focused (speech only)	Deepgram Nova-3

For the full model matrix, see the SDK Reference.

Two-pass analysis

Multimodal audio can be slow (10–15s). Use the two-pass pattern to hide latency:

Pass 1 (fire-and-forget): send audio for general analysis while the Ability talks to the user
Pass 2 (on-demand): when the user asks a specific question, inject Pass 1’s result and answer with depth

Next steps

Hot Mic + Deepgram — the audio-recording API that powers all of this
SDK Reference → Prompt patterns — prompts 3 and 4 are the audio-analysis workhorses
Cookbook → Hot-mic + Deepgram showcase — 11 ideas already built on this pattern

​What audio intelligence unlocks

​The core pattern

​Recommended models

​Two-pass analysis

​Next steps

What audio intelligence unlocks

The core pattern

Recommended models

Two-pass analysis

Next steps