Package Exports

@speech-sdk/core
@speech-sdk/core/cartesia
@speech-sdk/core/conversation
@speech-sdk/core/conversation/errors
@speech-sdk/core/deepgram
@speech-sdk/core/elevenlabs
@speech-sdk/core/fal-ai
@speech-sdk/core/fish-audio
@speech-sdk/core/google
@speech-sdk/core/hume
@speech-sdk/core/inworld
@speech-sdk/core/mistral
@speech-sdk/core/murf
@speech-sdk/core/openai
@speech-sdk/core/resemble
@speech-sdk/core/xai

Readme

Speech SDK

The Speech SDK is a lightweight, provider-agnostic TypeScript toolkit designed to help build text-to-speech powered applications using popular providers like OpenAI, ElevenLabs, Deepgram, Cartesia, Google, and more. Cross-platform (Node.js, Edge, Browser) with minimal dependencies.

To learn more about the Speech SDK, check out https://speechsdk.dev/.

Install

npm install @speech-sdk/core

Using an AI Coding Assistant?

Add the speech-sdk skill to give your AI assistant full knowledge of this library:

npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk

Quick Start

import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'openai/gpt-4o-mini-tts',
  text: 'Hello from speech-sdk!',
  voice: 'alloy',
});

// Access the audio
result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy-computed)
result.audio.mediaType;   // "audio/mpeg"

Volume normalization

Pass volumeDbfs to RMS-normalize the output to an absolute target loudness (must be ≤ 0; lower is quieter; -20 is the broadcast/podcast voice convention with ~20 dB of peak headroom):

const result = await generateSpeech({
  model: 'openai/gpt-4o-mini-tts',
  text: 'Hello from speech-sdk!',
  voice: 'alloy',
  volumeDbfs: -20,
});

result.audio.mediaType;   // "audio/wav" — re-encoded after normalization

When volumeDbfs is set the SDK transparently asks the provider for its decodable PCM/WAV mode, normalizes the samples, and returns 16-bit mono WAV — so the response mediaType switches to audio/wav regardless of the provider's native default. Throws VolumeAdjustmentUnsupportedError if the provider has no decodable output mode.

Streaming

Use streamSpeech() instead of generateSpeech() to receive audio bytes incrementally as the provider produces them. The result's audio field is a standard ReadableStream<Uint8Array> that works in Node, Edge runtimes, and browsers.

import { streamSpeech } from "@speech-sdk/core";

const { audio, mediaType } = await streamSpeech({
  model: "openai/tts-1",
  text: "Hello from the speech SDK!",
  voice: "alloy",
});

Pipe to a file (Node)

import { createWriteStream } from "node:fs";
import { Readable } from "node:stream";

const { audio } = await streamSpeech({
  model: "elevenlabs/eleven_flash_v2_5",
  text: "Hello world",
  voice: "JBFqnCBsd6RMkjVDRZzb",
});

await new Promise((resolve, reject) => {
  Readable.fromWeb(audio).pipe(createWriteStream("out.mp3")).on("finish", resolve).on("error", reject);
});

Forward to an HTTP response (Edge / Workers / Next.js Route Handler)

export async function GET() {
  const { audio, mediaType } = await streamSpeech({
    model: "cartesia/sonic-3",
    text: "Streaming straight to the client.",
    voice: "voice-id",
  });

  return new Response(audio, { headers: { "Content-Type": mediaType } });
}

Read chunks manually

const reader = audio.getReader();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  // value is a Uint8Array of audio bytes
}

Capability check

Check whether a model supports streaming before calling streamSpeech():

import { hasFeature } from "@speech-sdk/core";

const model = provider.models.find((m) => m.id === "tts-1");
if (hasFeature(model, "streaming")) {
  // safe to call streamSpeech()
}

Calling streamSpeech() on a model that doesn't declare the "streaming" feature throws StreamingNotSupportedError.

Errors and retries

Retries apply only to the initial request, until response headers arrive. Once bytes start flowing, mid-stream errors propagate to the ReadableStream consumer as a stream error and are not retried. Pass maxRetries (default 2) and an abortSignal the same way as generateSpeech().

Conversations

generateConversation() produces a single multi-voice audio clip from an ordered array of turns. It picks the best path automatically:

Native dialogue — when every turn shares one model and that provider has a real multi-speaker dialogue endpoint, the SDK makes a single API call and returns the provider's natural mix. Works with ElevenLabs v3, Google Gemini TTS (exactly 2 voices), Hume Octave, Fish Audio S2-Pro, and fal Dia.
Stitch fallback — when turns span multiple providers, or the chosen model has no native dialogue endpoint, the SDK calls generateSpeech() per turn in parallel, normalizes each result to PCM, RMS-levels them so quieter providers don't get drowned out, inserts a configurable silence between turns, and returns a single WAV.

import { generateConversation } from "@speech-sdk/core/conversation";

const result = await generateConversation({
  turns: [
    { model: "openai/tts-1", voice: "nova", text: "Hi, I'm hosted by OpenAI." },
    { model: "elevenlabs/eleven_multilingual_v2", voice: "JBFqnCBsd6RMkjVDRZzb", text: "And I'm hosted by ElevenLabs." },
    { model: "google/gemini-3.1-flash-tts-preview", voice: "Kore", text: "I'm Gemini three-point-one flash TTS." },
    { model: "hume/octave-2", voice: "Kora", text: "And I'm Hume Octave. Thanks for listening." },
  ],
});

result.audio.uint8Array;  // Uint8Array of one combined WAV
result.audio.mediaType;   // "audio/wav"

The return type is the standard SpeechResult, so it composes with everything else in the SDK.

Conversation options

generateConversation({
  model?: string | ResolvedModel,                 // default model for all turns
  turns: ConversationTurn[],                      // 1..N turns; any number of unique voices
  gapMs?: number,                                 // silence between turns (stitch path), default 300
  normalizeVolume?: boolean,                      // RMS-level the output, default true
  volumeDbfs?: number,                            // RMS target loudness in dBFS (≤0), default -20
  maxConcurrency?: number,                        // cap parallel generateSpeech calls, default 6
  maxRetries?: number,                            // per-turn retries, default 2
  apiKey?: string,
  providerOptions?: Record<string, unknown>,      // forwarded to every provider; per-turn override available
  abortSignal?: AbortSignal,
  headers?: Record<string, string>,
});

interface ConversationTurn {
  voice: Voice;                                   // required
  text: string;                                   // required, non-empty
  model?: string | ResolvedModel;                 // per-turn override of the top-level model
  providerOptions?: Record<string, unknown>,      // stitch path only; see note below
}

Per-turn providerOptions are merged with the top-level providerOptions on the stitch path — each turn's underlying generateSpeech() call receives { ...topLevel, ...turn, ...stitchDefaults }. On the native-dialogue path the provider renders the whole script in one API call, so per-turn overrides have no well-defined meaning; setting providerOptions on any turn throws ConversationInputError. Move the options to the top-level providerOptions (forwarded once to the dialogue call) instead.

Volume normalization

normalizeVolume: true (the default) RMS-normalizes the output to an absolute target loudness — broadcast/podcast voice convention — so two generateConversation calls produce comparable levels regardless of provider mix or content. The target defaults to −20 dBFS (~20 dB of peak headroom), and is configurable via volumeDbfs (must be ≤ 0; lower is quieter).

await generateConversation({
  turns: [...],
  volumeDbfs: -16,           // a touch louder than the default
});

Normalization runs on both paths — stitched multi-provider conversations and single-provider native dialogue. On the native path the SDK transparently asks the provider for its decodable PCM/WAV mode (via getStitchOptions), levels the result, and re-encodes as 16-bit mono WAV — so the response mediaType becomes audio/wav whenever normalization runs. If a native dialogue provider can't emit decodable audio, the request still succeeds but a warning is appended explaining that volume normalization was skipped.

Pass normalizeVolume: false to skip normalization entirely (zero work) and keep the raw provider audio bytes and mediaType untouched.

Errors

Conversation-specific errors (re-exported from @speech-sdk/core/conversation alongside generateConversation, or importable on their own from @speech-sdk/core/conversation/errors):

Error	When
`ConversationInputError`	Validation failure — empty turns, blank text, or a turn missing a model
`DialogueConstraintError`	A native-dialogue provider was selected but the conversation violates its constraints (e.g. 3 voices on Gemini, which requires exactly 2)
`StitchUnsupportedError`	The stitch path was selected but a chosen provider/model can't emit PCM/WAV

Native dialogue caps

Provider	Native dialogue model	Voice constraints
ElevenLabs	`eleven_v3`	1–10 voices, ≤ 2,000 total chars
Google	`gemini-2.5-flash-preview-tts`, `gemini-2.5-pro-preview-tts`, `gemini-3.1-flash-tts-preview`	Exactly 2 voices (API requirement)
Hume	`octave-1`, `octave-2`	1–4 voices
Fish Audio	`s2-pro`	1–4 voices
fal	`dia-tts`	1–2 voices

Supported Providers

Use provider/model strings. Passing just the provider name uses its default model.

Provider	String Prefix	Default Model	Env Var	Docs
OpenAI	`openai`	`gpt-4o-mini-tts`	`OPENAI_API_KEY`	API Reference
ElevenLabs	`elevenlabs`	`eleven_multilingual_v2`	`ELEVENLABS_API_KEY`	API Reference
Deepgram	`deepgram`	`aura-2`	`DEEPGRAM_API_KEY`	API Reference
Cartesia	`cartesia`	`sonic-3`	`CARTESIA_API_KEY`	API Reference
Hume	`hume`	`octave-2`	`HUME_API_KEY`	API Reference
Inworld	`inworld`	`inworld-tts-1.5-max`	`INWORLD_API_KEY`	API Reference
Google (Gemini TTS)	`google`	`gemini-2.5-flash-preview-tts`	`GOOGLE_API_KEY`	API Reference
Fish Audio	`fish-audio`	`s2-pro`	`FISH_AUDIO_API_KEY`	API Reference
Murf	`murf`	`GEN2`	`MURF_API_KEY`	API Reference
Resemble	`resemble`	`default`	`RESEMBLE_API_KEY`	API Reference
fal	`fal-ai`	(user-specified)	`FAL_API_KEY`	API Reference
Mistral	`mistral`	`voxtral-mini-tts-2603`	`MISTRAL_API_KEY`	API Reference
xAI	`xai`	`grok-tts`	`XAI_API_KEY`	API Reference

generateSpeech({ model: 'openai/tts-1', text: '...', voice: 'alloy' });
generateSpeech({ model: 'elevenlabs/eleven_v3', text: '...', voice: 'voice-id' });
generateSpeech({ model: 'deepgram/aura-2', text: '...', voice: 'thalia-en' });
generateSpeech({ model: 'inworld/inworld-tts-1.5-max', text: '...', voice: 'Ashley' });
generateSpeech({ model: 'openai', text: '...', voice: 'alloy' });  // uses default model

Provider-specific API parameters can be passed via providerOptions — these are sent directly to the provider's API using the API's own field names.

Custom Configuration

Use factory functions when you need custom API keys, base URLs, or fetch implementations:

import { generateSpeech } from '@speech-sdk/core';
import { createOpenAI } from '@speech-sdk/core/openai';
import { createElevenLabs } from '@speech-sdk/core/elevenlabs';

const myOpenAI = createOpenAI({
  apiKey: 'sk-...',
  baseURL: 'https://my-proxy.com/v1',
});

const result = await generateSpeech({
  model: myOpenAI('gpt-4o-mini-tts'),
  text: 'Hello!',
  voice: 'alloy',
});

API Key Resolution

When using string models (e.g., 'openai/tts-1'), API keys are resolved from environment variables (see table above). Factory functions accept an explicit apiKey option which takes precedence.

Audio Tags

Use bracket syntax [tag] to add expressive audio cues like laughter, sighs, or emotions. Provider support varies — unsupported tags are automatically stripped with warnings returned in result.warnings.

const result = await generateSpeech({
  model: 'elevenlabs/eleven_v3',
  text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
  voice: 'voice-id',
});

console.log(result.warnings); // undefined — eleven_v3 supports all tags

Provider behavior

Provider	Behavior
OpenAI (`gpt-4o-mini-tts`)	Tags mapped to the `instructions` field for expressive delivery control
ElevenLabs (`eleven_v3`)	All `[tag]` passed through natively
Google (`gemini-3.1-flash-tts-preview`)	All `[tag]` passed through natively (e.g. `[whispers]`, `[shouting]`, `[sighs]`, `[laugh]`)
Cartesia (`sonic-3`)	Emotion tags (`[happy]`, `[sad]`, `[angry]`, etc.) converted to SSML; `[laughter]` passed through; unknown tags stripped
All others	Tags stripped and warnings returned

// OpenAI gpt-4o-mini-tts — tags are mapped to the `instructions` field
const result = await generateSpeech({
  model: 'openai/gpt-4o-mini-tts',
  text: '[cheerfully] Hi John how are you? [soft] I\'m feeling great',
  voice: 'alloy',
});
// Sent to OpenAI:
//   input: "Hi John how are you? I'm feeling great"
//   instructions: "Delivery shifts through the text in order: begin cheerfully, then soft."
console.log(result.warnings); // undefined

Voice Cloning

Some providers support voice cloning via reference audio. Pass a voice object instead of a string:

import { createMistral } from '@speech-sdk/core/mistral';

const mistral = createMistral();

// Clone from base64 audio
const result = await generateSpeech({
  model: mistral(),
  text: 'Hello!',
  voice: { audio: 'base64-encoded-audio...' },
});

Clone from a URL (fal):

import { createFal } from '@speech-sdk/core/fal-ai';

const fal = createFal();
const result = await generateSpeech({
  model: fal('fal-ai/chatterbox'),
  text: 'Hello!',
  voice: { url: 'https://example.com/reference.wav' },
});

Options

generateSpeech({
  model: string | ResolvedModel,  // required
  text: string,                   // required
  voice: Voice,                   // required
  providerOptions?: object,       // provider-specific API params
  maxRetries?: number,            // default: 2 (retries on 5xx/network errors)
  abortSignal?: AbortSignal,      // cancel the request
  headers?: Record<string, string>, // additional HTTP headers
});

Result

interface SpeechResult {
  audio: {
    uint8Array: Uint8Array;   // raw audio bytes
    base64: string;           // base64 encoded (lazy)
    mediaType: string;        // e.g. "audio/mpeg"
  };
  providerMetadata?: Record<string, unknown>;
}

Error Handling

import { generateSpeech, ApiError, SpeechSDKError } from '@speech-sdk/core';

try {
  const result = await generateSpeech({ ... });
} catch (error) {
  if (error instanceof ApiError) {
    console.log(error.statusCode);  // 401
    console.log(error.model);       // "openai/gpt-4o-mini-tts"
    console.log(error.responseBody);
  }
}

Error	When
`ApiError`	Provider API returns a non-2xx response
`NoSpeechGeneratedError`	Provider returned empty audio
`SpeechSDKError`	Base class for all errors

Retry

Built-in retry with exponential backoff via p-retry. Retries on 5xx and network errors. Does not retry 4xx errors. Default: 2 retries.

Development

pnpm install
pnpm test                       # unit tests
pnpm run test:e2e               # e2e tests (requires API keys)
pnpm run typecheck              # type-check without emitting

E2E tests hit real provider APIs. Set the relevant API key environment variables in a .env file or export them in your shell.

Set SPEECH_SDK_E2E_OUTPUT_DIR to have the conversation e2e tests write their generated audio to disk (useful for sampling/comparing provider output):

SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos pnpm run test:e2e

License

MIT