@ariaflowagents/realtime-audio

Realtime audio pipeline for AriaFlow — the multi-provider foundation for speech-to-speech voice agents and their orchestration. Ships provider clients for Google Gemini Live and OpenAI Realtime today, a provider-agnostic RealtimeAudioClient interface other providers plug into, and a VoiceEngine / CallWorker pair that bridges any audio transport (WebSocket, LiveKit, etc.) to the chosen provider while handling tools, session state, and event logging via AriaFlow's Foundation primitives. (Renamed from @ariaflowagents/gemini-native-audio at v0.10.0; the historical "Gemini Live native audio" docs below reflect the original Gemini-specific slice and remain accurate for that provider.)

What This Does

Unlike traditional voice pipelines (STT → LLM → TTS), Gemini Live accepts raw audio input and produces raw audio output in a single model call. This package wraps that capability for AriaFlow agents:

VoiceEngine — Call acceptor. Accepts incoming audio connections and creates per-call workers.
CallWorker — Per-call lifecycle manager. Bridges your audio transport (WebSocket, LiveKit, etc.) to a Gemini Live session. Handles tool calls, session state, and event logging using AriaFlow's Foundation primitives.
GeminiLiveSession — Thin wrapper around @google/genai ai.live.connect(). Manages the WebSocket connection to Gemini, audio encoding (base64 PCM ↔ Uint8Array), tool dispatch, and session resumption.
toolSetToGeminiDeclarations — Converts AriaFlow/AI SDK tool definitions (Zod schemas) to Gemini's FunctionDeclaration format.

Architecture

┌─────────────┐     ┌─────────────┐     ┌────────────────────┐
│   Client     │────>│ CallWorker  │────>│ GeminiLiveSession  │
│  (WebSocket) │     │             │     │                    │
│              │<────│  audio +    │<────│  Gemini Live API   │
│  audio in/out│     │  tool calls │     │  (native audio)    │
└─────────────┘     └─────────────┘     └────────────────────┘
                          │
                          ├── ToolExecutor (runs AriaFlow tools)
                          ├── ConversationState (persists transcripts)
                          └── ConversationEventLog (records events)

Usage

import { VoiceEngine } from '@ariaflowagents/realtime-audio';
import { createFoundation } from '@ariaflowagents/core/foundation';

const foundation = createFoundation({ /* ... */ });

const engine = new VoiceEngine({
  foundation,
  agents: [
    {
      id: 'receptionist',
      name: 'Hospital Receptionist',
      prompt: 'You are a hospital receptionist. Help patients schedule appointments.',
      voice: 'Charon', // Gemini voice preset
      tools: { /* AriaFlow tools */ },
    },
  ],
  defaultAgentId: 'receptionist',
  gemini: {
    apiKey: process.env.GOOGLE_API_KEY!,
    model: 'gemini-2.5-flash-native-audio-preview', // default
  },
});

// Accept a call from any audio transport
const worker = await engine.acceptCall({
  callId: crypto.randomUUID(),
  transport: myWebSocketTransport, // implements TransportSession
});

await worker.start();

TransportSession Interface

Implement this to connect any audio source/sink:

interface TransportSession {
  sendAudio(data: Uint8Array): void;       // Send audio to client
  onAudio(handler: (data: Uint8Array) => void): void;  // Receive audio from client
  onClose(handler: () => void): void;      // Handle disconnect
  close(): void;                           // Close the transport
}

Events

GeminiLiveSession emits RealtimeEvents:

Event	Description
`audio`	Raw PCM audio from Gemini (send to client)
`transcript`	Text transcript (user or assistant)
`tool-call`	Gemini wants to call a tool
`tool-result`	Tool execution result
`turn-complete`	Model finished speaking
`interrupted`	User interrupted the model
`session-resumed`	Session resumption handle updated
`error`	Error from Gemini

Key Details

Audio format: 16-bit PCM at 24kHz
Default model: gemini-2.5-flash-native-audio-preview
Session resumption: Automatic — GeminiLiveSession tracks resumption handles
Tool execution: Uses AriaFlow's ToolExecutor with timeout support
State persistence: Transcripts are saved to session via ConversationState

Peer Dependencies

@ariaflowagents/core — Foundation primitives (ToolExecutor, ConversationState, etc.)
ai (v6+) — Vercel AI SDK
zod — Schema definitions for tools