Package Exports
- @speech-sdk/core
- @speech-sdk/core/cartesia
- @speech-sdk/core/conversation
- @speech-sdk/core/conversation/errors
- @speech-sdk/core/deepgram
- @speech-sdk/core/elevenlabs
- @speech-sdk/core/fal-ai
- @speech-sdk/core/fish-audio
- @speech-sdk/core/google
- @speech-sdk/core/hume
- @speech-sdk/core/inworld
- @speech-sdk/core/mistral
- @speech-sdk/core/murf
- @speech-sdk/core/openai
- @speech-sdk/core/resemble
- @speech-sdk/core/xai
Readme
Speech SDK
The Speech SDK is a lightweight, provider-agnostic TypeScript toolkit designed to help build text-to-speech powered applications using popular providers like OpenAI, ElevenLabs, Deepgram, Cartesia, Google, and more. Cross-platform (Node.js, Edge, Browser) with minimal dependencies.
To learn more about the Speech SDK, check out https://speechsdk.dev/.
Install
npm install @speech-sdk/coreUsing an AI Coding Assistant?
Add the speech-sdk skill to give your AI assistant full knowledge of this library:
npx skills add Jellypod-Inc/speech-sdk --skill speech-sdkQuick Start
import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'openai/gpt-4o-mini-tts',
text: 'Hello from speech-sdk!',
voice: 'alloy',
});
// Access the audio
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy-computed)
result.audio.mediaType; // "audio/mpeg"Volume normalization
Pass volumeDbfs to RMS-normalize the output to an absolute target loudness (must be ≤ 0; lower is quieter; -20 is the broadcast/podcast voice convention with ~20 dB of peak headroom):
const result = await generateSpeech({
model: 'openai/gpt-4o-mini-tts',
text: 'Hello from speech-sdk!',
voice: 'alloy',
volumeDbfs: -20,
});
result.audio.mediaType; // "audio/wav" — re-encoded after normalizationWhen volumeDbfs is set the SDK transparently asks the provider for its decodable PCM/WAV mode, normalizes the samples, and returns 16-bit mono WAV — so the response mediaType switches to audio/wav regardless of the provider's native default. Throws VolumeAdjustmentUnsupportedError if the provider has no decodable output mode.
Streaming
Use streamSpeech() instead of generateSpeech() to receive audio bytes incrementally as the provider produces them. The result's audio field is a standard ReadableStream<Uint8Array> that works in Node, Edge runtimes, and browsers.
import { streamSpeech } from "@speech-sdk/core";
const { audio, mediaType } = await streamSpeech({
model: "openai/tts-1",
text: "Hello from the speech SDK!",
voice: "alloy",
});Pipe to a file (Node)
import { createWriteStream } from "node:fs";
import { Readable } from "node:stream";
const { audio } = await streamSpeech({
model: "elevenlabs/eleven_flash_v2_5",
text: "Hello world",
voice: "JBFqnCBsd6RMkjVDRZzb",
});
await new Promise((resolve, reject) => {
Readable.fromWeb(audio).pipe(createWriteStream("out.mp3")).on("finish", resolve).on("error", reject);
});Forward to an HTTP response (Edge / Workers / Next.js Route Handler)
export async function GET() {
const { audio, mediaType } = await streamSpeech({
model: "cartesia/sonic-3",
text: "Streaming straight to the client.",
voice: "voice-id",
});
return new Response(audio, { headers: { "Content-Type": mediaType } });
}Read chunks manually
const reader = audio.getReader();
while (true) {
const { value, done } = await reader.read();
if (done) break;
// value is a Uint8Array of audio bytes
}Capability check
Check whether a model supports streaming before calling streamSpeech():
import { hasFeature } from "@speech-sdk/core";
const model = provider.models.find((m) => m.id === "tts-1");
if (hasFeature(model, "streaming")) {
// safe to call streamSpeech()
}Calling streamSpeech() on a model that doesn't declare the "streaming" feature throws StreamingNotSupportedError.
Errors and retries
Retries apply only to the initial request, until response headers arrive. Once bytes start flowing, mid-stream errors propagate to the ReadableStream consumer as a stream error and are not retried. Pass maxRetries (default 2) and an abortSignal the same way as generateSpeech().
Conversations
generateConversation() produces a single multi-voice audio clip from an ordered array of turns. It picks the best path automatically:
- Native dialogue — when every turn shares one model and that provider has a real multi-speaker dialogue endpoint, the SDK makes a single API call and returns the provider's natural mix. Works with ElevenLabs v3, Google Gemini TTS (exactly 2 voices), Hume Octave, Fish Audio S2-Pro, and fal Dia.
- Stitch fallback — when turns span multiple providers, or the chosen model has no native dialogue endpoint, the SDK calls
generateSpeech()per turn in parallel, normalizes each result to PCM, RMS-levels them so quieter providers don't get drowned out, inserts a configurable silence between turns, and returns a single WAV.
import { generateConversation } from "@speech-sdk/core/conversation";
const result = await generateConversation({
turns: [
{ model: "openai/tts-1", voice: "nova", text: "Hi, I'm hosted by OpenAI." },
{ model: "elevenlabs/eleven_multilingual_v2", voice: "JBFqnCBsd6RMkjVDRZzb", text: "And I'm hosted by ElevenLabs." },
{ model: "google/gemini-3.1-flash-tts-preview", voice: "Kore", text: "I'm Gemini three-point-one flash TTS." },
{ model: "hume/octave-2", voice: "Kora", text: "And I'm Hume Octave. Thanks for listening." },
],
});
result.audio.uint8Array; // Uint8Array of one combined WAV
result.audio.mediaType; // "audio/wav"The return type is the standard SpeechResult, so it composes with everything else in the SDK.
Conversation options
generateConversation({
model?: string | ResolvedModel, // default model for all turns
turns: ConversationTurn[], // 1..N turns; any number of unique voices
gapMs?: number, // silence between turns (stitch path), default 300
normalizeVolume?: boolean, // RMS-level the output, default true
volumeDbfs?: number, // RMS target loudness in dBFS (≤0), default -20
maxConcurrency?: number, // cap parallel generateSpeech calls, default 6
maxRetries?: number, // per-turn retries, default 2
apiKey?: string,
providerOptions?: Record<string, unknown>, // forwarded to every provider; per-turn override available
abortSignal?: AbortSignal,
headers?: Record<string, string>,
});
interface ConversationTurn {
voice: Voice; // required
text: string; // required, non-empty
model?: string | ResolvedModel; // per-turn override of the top-level model
providerOptions?: Record<string, unknown>, // stitch path only; see note below
}Per-turn providerOptions are merged with the top-level providerOptions on the stitch path — each turn's underlying generateSpeech() call receives { ...topLevel, ...turn, ...stitchDefaults }. On the native-dialogue path the provider renders the whole script in one API call, so per-turn overrides have no well-defined meaning; setting providerOptions on any turn throws ConversationInputError. Move the options to the top-level providerOptions (forwarded once to the dialogue call) instead.
Volume normalization
normalizeVolume: true (the default) RMS-normalizes the output to an absolute target loudness — broadcast/podcast voice convention — so two generateConversation calls produce comparable levels regardless of provider mix or content. The target defaults to −20 dBFS (~20 dB of peak headroom), and is configurable via volumeDbfs (must be ≤ 0; lower is quieter).
await generateConversation({
turns: [...],
volumeDbfs: -16, // a touch louder than the default
});Normalization runs on both paths — stitched multi-provider conversations and single-provider native dialogue. On the native path the SDK transparently asks the provider for its decodable PCM/WAV mode (via getStitchOptions), levels the result, and re-encodes as 16-bit mono WAV — so the response mediaType becomes audio/wav whenever normalization runs. If a native dialogue provider can't emit decodable audio, the request still succeeds but a warning is appended explaining that volume normalization was skipped.
Pass normalizeVolume: false to skip normalization entirely (zero work) and keep the raw provider audio bytes and mediaType untouched.
Errors
Conversation-specific errors (re-exported from @speech-sdk/core/conversation alongside generateConversation, or importable on their own from @speech-sdk/core/conversation/errors):
| Error | When |
|---|---|
ConversationInputError |
Validation failure — empty turns, blank text, or a turn missing a model |
DialogueConstraintError |
A native-dialogue provider was selected but the conversation violates its constraints (e.g. 3 voices on Gemini, which requires exactly 2) |
StitchUnsupportedError |
The stitch path was selected but a chosen provider/model can't emit PCM/WAV |
Native dialogue caps
| Provider | Native dialogue model | Voice constraints |
|---|---|---|
| ElevenLabs | eleven_v3 |
1–10 voices, ≤ 2,000 total chars |
gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts, gemini-3.1-flash-tts-preview |
Exactly 2 voices (API requirement) | |
| Hume | octave-1, octave-2 |
1–4 voices |
| Fish Audio | s2-pro |
1–4 voices |
| fal | dia-tts |
1–2 voices |
Supported Providers
Use provider/model strings. Passing just the provider name uses its default model.
| Provider | String Prefix | Default Model | Env Var | Docs |
|---|---|---|---|---|
| OpenAI | openai |
gpt-4o-mini-tts |
OPENAI_API_KEY |
API Reference |
| ElevenLabs | elevenlabs |
eleven_multilingual_v2 |
ELEVENLABS_API_KEY |
API Reference |
| Deepgram | deepgram |
aura-2 |
DEEPGRAM_API_KEY |
API Reference |
| Cartesia | cartesia |
sonic-3 |
CARTESIA_API_KEY |
API Reference |
| Hume | hume |
octave-2 |
HUME_API_KEY |
API Reference |
| Inworld | inworld |
inworld-tts-1.5-max |
INWORLD_API_KEY |
API Reference |
| Google (Gemini TTS) | google |
gemini-2.5-flash-preview-tts |
GOOGLE_API_KEY |
API Reference |
| Fish Audio | fish-audio |
s2-pro |
FISH_AUDIO_API_KEY |
API Reference |
| Murf | murf |
GEN2 |
MURF_API_KEY |
API Reference |
| Resemble | resemble |
default |
RESEMBLE_API_KEY |
API Reference |
| fal | fal-ai |
(user-specified) | FAL_API_KEY |
API Reference |
| Mistral | mistral |
voxtral-mini-tts-2603 |
MISTRAL_API_KEY |
API Reference |
| xAI | xai |
grok-tts |
XAI_API_KEY |
API Reference |
generateSpeech({ model: 'openai/tts-1', text: '...', voice: 'alloy' });
generateSpeech({ model: 'elevenlabs/eleven_v3', text: '...', voice: 'voice-id' });
generateSpeech({ model: 'deepgram/aura-2', text: '...', voice: 'thalia-en' });
generateSpeech({ model: 'inworld/inworld-tts-1.5-max', text: '...', voice: 'Ashley' });
generateSpeech({ model: 'openai', text: '...', voice: 'alloy' }); // uses default modelProvider-specific API parameters can be passed via providerOptions — these are sent directly to the provider's API using the API's own field names.
Custom Configuration
Use factory functions when you need custom API keys, base URLs, or fetch implementations:
import { generateSpeech } from '@speech-sdk/core';
import { createOpenAI } from '@speech-sdk/core/openai';
import { createElevenLabs } from '@speech-sdk/core/elevenlabs';
const myOpenAI = createOpenAI({
apiKey: 'sk-...',
baseURL: 'https://my-proxy.com/v1',
});
const result = await generateSpeech({
model: myOpenAI('gpt-4o-mini-tts'),
text: 'Hello!',
voice: 'alloy',
});API Key Resolution
When using string models (e.g., 'openai/tts-1'), API keys are resolved from environment variables (see table above). Factory functions accept an explicit apiKey option which takes precedence.
Audio Tags
Use bracket syntax [tag] to add expressive audio cues like laughter, sighs, or emotions. Provider support varies — unsupported tags are automatically stripped with warnings returned in result.warnings.
const result = await generateSpeech({
model: 'elevenlabs/eleven_v3',
text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
voice: 'voice-id',
});
console.log(result.warnings); // undefined — eleven_v3 supports all tagsProvider behavior
| Provider | Behavior |
|---|---|
OpenAI (gpt-4o-mini-tts) |
Tags mapped to the instructions field for expressive delivery control |
ElevenLabs (eleven_v3) |
All [tag] passed through natively |
Google (gemini-3.1-flash-tts-preview) |
All [tag] passed through natively (e.g. [whispers], [shouting], [sighs], [laugh]) |
Cartesia (sonic-3) |
Emotion tags ([happy], [sad], [angry], etc.) converted to SSML; [laughter] passed through; unknown tags stripped |
| All others | Tags stripped and warnings returned |
// OpenAI gpt-4o-mini-tts — tags are mapped to the `instructions` field
const result = await generateSpeech({
model: 'openai/gpt-4o-mini-tts',
text: '[cheerfully] Hi John how are you? [soft] I\'m feeling great',
voice: 'alloy',
});
// Sent to OpenAI:
// input: "Hi John how are you? I'm feeling great"
// instructions: "Delivery shifts through the text in order: begin cheerfully, then soft."
console.log(result.warnings); // undefinedVoice Cloning
Some providers support voice cloning via reference audio. Pass a voice object instead of a string:
import { createMistral } from '@speech-sdk/core/mistral';
const mistral = createMistral();
// Clone from base64 audio
const result = await generateSpeech({
model: mistral(),
text: 'Hello!',
voice: { audio: 'base64-encoded-audio...' },
});Clone from a URL (fal):
import { createFal } from '@speech-sdk/core/fal-ai';
const fal = createFal();
const result = await generateSpeech({
model: fal('fal-ai/chatterbox'),
text: 'Hello!',
voice: { url: 'https://example.com/reference.wav' },
});Options
generateSpeech({
model: string | ResolvedModel, // required
text: string, // required
voice: Voice, // required
providerOptions?: object, // provider-specific API params
maxRetries?: number, // default: 2 (retries on 5xx/network errors)
abortSignal?: AbortSignal, // cancel the request
headers?: Record<string, string>, // additional HTTP headers
});Result
interface SpeechResult {
audio: {
uint8Array: Uint8Array; // raw audio bytes
base64: string; // base64 encoded (lazy)
mediaType: string; // e.g. "audio/mpeg"
};
providerMetadata?: Record<string, unknown>;
}Error Handling
import { generateSpeech, ApiError, SpeechSDKError } from '@speech-sdk/core';
try {
const result = await generateSpeech({ ... });
} catch (error) {
if (error instanceof ApiError) {
console.log(error.statusCode); // 401
console.log(error.model); // "openai/gpt-4o-mini-tts"
console.log(error.responseBody);
}
}| Error | When |
|---|---|
ApiError |
Provider API returns a non-2xx response |
NoSpeechGeneratedError |
Provider returned empty audio |
SpeechSDKError |
Base class for all errors |
Retry
Built-in retry with exponential backoff via p-retry. Retries on 5xx and network errors. Does not retry 4xx errors. Default: 2 retries.
Development
pnpm install
pnpm test # unit tests
pnpm run test:e2e # e2e tests (requires API keys)
pnpm run typecheck # type-check without emittingE2E tests hit real provider APIs. Set the relevant API key environment variables in a .env file or export them in your shell.
Set SPEECH_SDK_E2E_OUTPUT_DIR to have the conversation e2e tests write their generated audio to disk (useful for sampling/comparing provider output):
SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos pnpm run test:e2eLicense
MIT