What audio intelligence unlocks
| Domain | What the LLM hears |
|---|---|
| Music production | Space between notes, tempo drift, mix imbalance — what Rick Rubin hears |
| Home safety | Smoke alarms, breaking glass, CO alerts by acoustic signature, not keywords |
| Medical | Breath sounds — wheeze, crackle, deviations from baseline |
| Automotive | Engine knock, rattle, bearing wear before the warning light |
| Wildlife research | Species identification by call, behavioral patterns |
| Language learning | Pronunciation, prosody, accent drift |
The core pattern
Recommended models
| Use case | Model |
|---|---|
| General audio reasoning | google/gemini-2.5-flash-preview |
| Deepest audio analysis | google/gemini-3-flash-preview (latest multimodal) |
| Transcription-focused (speech only) | Deepgram Nova-3 |
Two-pass analysis
Multimodal audio can be slow (10–15s). Use the two-pass pattern to hide latency:- Pass 1 (fire-and-forget): send audio for general analysis while the Ability talks to the user
- Pass 2 (on-demand): when the user asks a specific question, inject Pass 1’s result and answer with depth
Next steps
- Hot Mic + Deepgram — the audio-recording API that powers all of this
- SDK Reference → Prompt patterns — prompts 3 and 4 are the audio-analysis workhorses
- Cookbook → Hot-mic + Deepgram showcase — 11 ideas already built on this pattern

