Package Exports
- @speech-sdk/core
- @speech-sdk/core/cartesia
- @speech-sdk/core/deepgram
- @speech-sdk/core/elevenlabs
- @speech-sdk/core/fal-ai
- @speech-sdk/core/fish-audio
- @speech-sdk/core/google
- @speech-sdk/core/hume
- @speech-sdk/core/mistral
- @speech-sdk/core/murf
- @speech-sdk/core/openai
- @speech-sdk/core/resemble
- @speech-sdk/core/unreal-speech
Readme
Speech SDK
The Speech SDK is a lightweight, provider-agnostic TypeScript toolkit designed to help build text-to-speech powered applications using popular providers like OpenAI, ElevenLabs, Deepgram, Cartesia, Google, and more. Cross-platform (Node.js, Edge, Browser) with minimal dependencies.
To learn more about the Speech SDK, check out https://speechsdk.dev/.
Install
npm install @speech-sdk/coreUsing an AI Coding Assistant?
Add the speech-sdk skill to give your AI assistant full knowledge of this library:
npx skills add Jellypod-Inc/speech-sdk --skill speech-sdkQuick Start
import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'openai/gpt-4o-mini-tts',
text: 'Hello from speech-sdk!',
voice: 'alloy',
});
// Access the audio
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy-computed)
result.audio.mediaType; // "audio/mpeg"Supported Providers
Use provider/model strings. Passing just the provider name uses its default model.
| Provider | String Prefix | Default Model | Env Var | Docs |
|---|---|---|---|---|
| OpenAI | openai |
gpt-4o-mini-tts |
OPENAI_API_KEY |
API Reference |
| ElevenLabs | elevenlabs |
eleven_multilingual_v2 |
ELEVENLABS_API_KEY |
API Reference |
| Deepgram | deepgram |
aura-2 |
DEEPGRAM_API_KEY |
API Reference |
| Cartesia | cartesia |
sonic-3 |
CARTESIA_API_KEY |
API Reference |
| Hume | hume |
octave-2 |
HUME_API_KEY |
API Reference |
| Google (Gemini TTS) | google |
gemini-2.5-flash-preview-tts |
GOOGLE_API_KEY |
API Reference |
| Fish Audio | fish-audio |
s2-pro |
FISH_AUDIO_API_KEY |
API Reference |
| Unreal Speech | unreal-speech |
default |
UNREAL_SPEECH_API_KEY |
API Reference |
| Murf | murf |
GEN2 |
MURF_API_KEY |
API Reference |
| Resemble | resemble |
default |
RESEMBLE_API_KEY |
API Reference |
| fal | fal-ai |
(user-specified) | FAL_API_KEY |
API Reference |
| Mistral | mistral |
voxtral-mini-tts-2603 |
MISTRAL_API_KEY |
API Reference |
generateSpeech({ model: 'openai/tts-1', text: '...', voice: 'alloy' });
generateSpeech({ model: 'elevenlabs/eleven_v3', text: '...', voice: 'voice-id' });
generateSpeech({ model: 'deepgram/aura-2', text: '...', voice: 'thalia-en' });
generateSpeech({ model: 'openai', text: '...', voice: 'alloy' }); // uses default modelProvider-specific API parameters can be passed via providerOptions — these are sent directly to the provider's API using the API's own field names.
Custom Configuration
Use factory functions when you need custom API keys, base URLs, or fetch implementations:
import { generateSpeech } from '@speech-sdk/core';
import { createOpenAI } from '@speech-sdk/core/openai';
import { createElevenLabs } from '@speech-sdk/core/elevenlabs';
const myOpenAI = createOpenAI({
apiKey: 'sk-...',
baseURL: 'https://my-proxy.com/v1',
});
const result = await generateSpeech({
model: myOpenAI('gpt-4o-mini-tts'),
text: 'Hello!',
voice: 'alloy',
});API Key Resolution
When using string models (e.g., 'openai/tts-1'), API keys are resolved from environment variables (see table above). Factory functions accept an explicit apiKey option which takes precedence.
Audio Tags
Use bracket syntax [tag] to add expressive audio cues like laughter, sighs, or emotions. Provider support varies — unsupported tags are automatically stripped with warnings returned in result.warnings.
const result = await generateSpeech({
model: 'elevenlabs/eleven_v3',
text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
voice: 'voice-id',
});
console.log(result.warnings); // undefined — eleven_v3 supports all tagsProvider behavior
| Provider | Behavior |
|---|---|
ElevenLabs (eleven_v3) |
All [tag] passed through natively |
Cartesia (sonic-3) |
Emotion tags ([happy], [sad], [angry], etc.) converted to SSML; [laughter] passed through; unknown tags stripped |
| All others | Tags stripped and warnings returned |
// Unsupported provider — tags are stripped with warnings
const result = await generateSpeech({
model: 'openai/gpt-4o-mini-tts',
text: '[laugh] Hello world',
voice: 'alloy',
});
console.log(result.warnings);
// ["Audio tag [laugh] is not supported by openai/gpt-4o-mini-tts and was removed."]Voice Cloning
Some providers support voice cloning via reference audio. Pass a voice object instead of a string:
import { createMistral } from '@speech-sdk/core/mistral';
const mistral = createMistral();
// Clone from base64 audio
const result = await generateSpeech({
model: mistral(),
text: 'Hello!',
voice: { audio: 'base64-encoded-audio...' },
});Clone from a URL (fal):
import { createFal } from '@speech-sdk/core/fal-ai';
const fal = createFal();
const result = await generateSpeech({
model: fal('fal-ai/chatterbox'),
text: 'Hello!',
voice: { url: 'https://example.com/reference.wav' },
});Options
generateSpeech({
model: string | ResolvedModel, // required
text: string, // required
voice: Voice, // required
providerOptions?: object, // provider-specific API params
maxRetries?: number, // default: 2 (retries on 5xx/network errors)
abortSignal?: AbortSignal, // cancel the request
headers?: Record<string, string>, // additional HTTP headers
});Result
interface SpeechResult {
audio: {
uint8Array: Uint8Array; // raw audio bytes
base64: string; // base64 encoded (lazy)
mediaType: string; // e.g. "audio/mpeg"
};
providerMetadata?: Record<string, unknown>;
}Error Handling
import { generateSpeech, ApiError, SpeechSDKError } from '@speech-sdk/core';
try {
const result = await generateSpeech({ ... });
} catch (error) {
if (error instanceof ApiError) {
console.log(error.statusCode); // 401
console.log(error.model); // "openai/gpt-4o-mini-tts"
console.log(error.responseBody);
}
}| Error | When |
|---|---|
ApiError |
Provider API returns a non-2xx response |
NoSpeechGeneratedError |
Provider returned empty audio |
SpeechSDKError |
Base class for all errors |
Retry
Built-in retry with exponential backoff via p-retry. Retries on 5xx and network errors. Does not retry 4xx errors. Default: 2 retries.
Development
pnpm install
pnpm test # unit tests
pnpm run test:e2e # e2e tests (requires API keys)
pnpm run typecheck # type-check without emittingE2E tests hit real provider APIs. Set the relevant API key environment variables in a .env file or export them in your shell.
License
MIT