Package Exports

@voctiv/agent-sdk

Readme

@voctiv/agent-sdk

TypeScript SDK for scripts executed by the ScriptEngine scripting runtime.

The package exports the defineScript() identity helper and the public ScriptEngine runtime types for voice channels, SIP calls, ASR, TTS, VAD, Smart Turn, LLM, dialog context, logging, and Voctiv legacy platform compatibility APIs.

The SDK itself does not open SIP calls, run ASR/TTS, or talk to platform services; it describes the objects injected into your script by ScriptEngine.

Installation

npm install @voctiv/agent-sdk rxjs

rxjs is a peer dependency because the runtime API exposes observables for ASR, SIP, channel events, queues, and LLM streams.

Basic Script

Scripts export a function created with defineScript(). The runtime loads the module and calls it with { channel, logger, context, platform }.

import { defineScript } from '@voctiv/agent-sdk';
import { filter, map, merge } from 'rxjs';

const TTS_QUEUE = 1;

export default defineScript(async ({ channel, logger, context }) => {
  channel.sip.answer(); // No-op on WS/headless channels.
  channel.sendMessage({ event: 'status', payload: 'ready' });

  const asr = await channel.createAsr({
    language: context.language || 'ru-RU',
    vad: { preSpeechFrames: 20, postSpeechFrames: 4 },
    smartTurn: { enabled: true, triggerFrames: 3, confirmMs: 50 },
  });

  // Barge-in: user speech stops only the agent TTS queue.
  merge(asr.speechStart$, asr.partial$.pipe(filter((p) => !!p.text?.trim()))).subscribe(() => {
    channel.audio.stop(TTS_QUEUE);
  });

  await channel.audio.say('Hi! Say anything and I will answer briefly.', {
    queue: TTS_QUEUE,
    alias: 'greeting',
  });

  asr.result$.pipe(filter((text) => !!text.trim())).subscribe((userText) => {
    logger.log('User said', { userText });

    const reply$ = channel.llm
      .stream(`Reply briefly to the user: "${userText}"`, {
        agentUuid: context.agentUuid,
        dialogUuid: context.dialogUuid,
      })
      .pipe(map((chunk) => chunk.content));

    void channel.audio.say(reply$, {
      queue: TTS_QUEUE,
      alias: 'reply',
      ttsStrategy: 'sentence',
    });
  });

  return new Promise((resolve) => {
    channel.events.terminated$.subscribe(() => {
      asr.destroy();
      resolve({ output: { dialogUuid: context.dialogUuid } });
    });
  });
});

What The SDK Contains

defineScript(fn) marks the default export as the script entry point. It returns the same function and exists to give TypeScript the correct ScriptContext shape.

ScriptContext is the top-level object passed to a script:

channel is the media channel for SIP, WS, ASR, TTS, audio playback, LLM, and structured data messages.
logger writes structured script logs and can stream logs to a debug endpoint.
context contains dialog identity, caller/called numbers, language, flags, params, entry point, persisted env, and runtime budget.
platform exposes legacy platform operations: NLU, dialog state, outbound call scheduling, messaging, and phrase records.

MediaChannel is the main real-time API:

channel.type is "sip" for telephony and "ws" for WebSocket/script-manager sessions. Headless sessions currently expose a synthetic "ws" channel; check context.headless to detect them.
channel.params is the merged runtime parameter map. Treat unknown keys as host-specific.
channel.createAsr() creates an ASR handle.
channel.audio controls TTS, raw playback, pre-synthesis, and mixer queues.
channel.sip controls SIP state, pre-answer media, DTMF, hold/mute/hangup, outbound calls, and bridging.
channel.llm talks to the Omni LLM backend.
channel.events exposes speech, interrupt, termination, and WS data message observables.
channel.textInput injects synthetic ASR results for tests and debug clients.

SIP And Pre-Answer Media

SIP sessions expose call state through channel.sip.state, state$, progress$, early$, and answered$.

The important states are:

ringing: INVITE is in progress, but no media is available yet.
early: RTP is ready before the final 200 OK answer. ASR, TTS, playback, and DTMF work in this state.
active: final 200 OK has been received or sent.
terminated: the call ended and no more audio is possible.

Outbound Pre-Answer

For outbound calls, early media starts when the remote side sends a provisional response with SDP, usually 183 Session Progress. This is useful for IVRs that speak before answering.

const bLeg = await channel.sip.makeCall({
  sipUri: 'sip:+12025551234@trunk.example.com',
});

await bLeg.sip.waitForEarly();

const asr = await bLeg.createAsr({ language: 'en-US' });
asr.result$.subscribe((text) => {
  if (/press one/i.test(text)) {
    bLeg.sip.sendDtmf('1');
  }
});

waitForEarly() resolves when the call reaches either early or active. If a carrier skips early media and answers directly, it resolves on the final answer.

Inbound Pre-Answer

For inbound calls, call channel.sip.sendProgress() to send 183 Session Progress with SDP. This enters early state and enables full-duplex audio before the final answer.

import { firstValueFrom } from 'rxjs';

channel.sip.sendProgress();
await channel.sip.waitForEarly();

const asr = await channel.createAsr({ language: 'en-US' });
await channel.audio.say('Please say your account number.');

const account = await firstValueFrom(asr.result$);

channel.sip.answer();
await channel.audio.say(`Thank you. Looking up account ${account}.`);

The API does not mark the call as answered until answer() sends final 200 OK. External billing still depends on carrier policy.

Audio Auto-Wait

On SIP channels, channel.audio.say() and channel.audio.play() automatically wait until RTP is ready (early or active). You only need explicit waitForEarly() / waitForAnswer() when your script logic depends on the state transition.

If the call terminates before media becomes available, deferred audio resolves as a no-op.

SIP Controls

channel.sip also supports:

answer() for inbound final answer.
sendDtmf(digit, duration?) for IVR navigation.
sendInfo(contentType, body) for SIP INFO messages.
hold() / unhold() for SIP hold.
mute() / unmute() for local outgoing audio suppression.
hangup() to terminate the call.
makeCall() to create an outbound SIP B-leg from the main SIP channel.
bridge(other) to cross-connect two SIP channels.

makeCall() and bridge() are only supported by the main SIP channel. Worker-isolated, WS, and headless channels do not create nested SIP legs.

ASR, VAD, And Smart Turn

Create ASR with channel.createAsr(config?).

const asr = await channel.createAsr({
  vendor: 'Y',
  name: 'main-yandex-key',
  language: 'ru-RU',
  vad: {
    positiveThreshold: 0.55,
    negativeThreshold: 0.35,
    preSpeechFrames: 12,
    postSpeechFrames: 12,
  },
  smartTurn: {
    enabled: true,
    silenceTimeoutMs: 1200,
  },
});

AsrHandle exposes:

result$: finalized utterances.
partial$: streaming partial hypotheses.
speechStart$ / speechEnd$: VAD speech boundaries.
interrupt$: barge-in / interrupt events where the host supports them.
vadProbability$: normalized VAD probability when available.
pause() / resume() to stop or resume forwarding new audio frames.
finalize() to force the current utterance to flush.
destroy() to close connector streams and subscriptions.

SIP sessions use the call-level telephony VAD when it is available. WS sessions create one VAD/SmartTurn instance for the socket session on the first createAsr() call. Headless sessions return an inert ASR handle with empty observables.

If ASR connector creation fails, SIP/WS return a degraded handle. VAD observables still mirror the channel where possible, but no real STT results are emitted.

ASR Credentials And Vendors

AsrConfig.vendor is an engine hint, for example "Y", "D", "yandex", or "neuro_v3", resolved by the host vendor alias mapping.

In Voctiv legacy compatibility mode, ASR credentials can be selected by logic-executor key_storage.name:

const asr = await channel.createAsr({
  name: 'main-asr-key',
  language: 'ru-RU',
});

The runtime looks in channel.params.authentication_data.legacyAsrKeysByName[name] for the current dialog agent and company. If name is omitted, channel.params.defaultAsrName may be used. Vendor-specific overrides go into data; primitives are stringified and objects/arrays are JSON-serialized before connector config is built.

TTS, Playback, And Mixer Queues

channel.audio.say(textOrObservable, options?) synthesizes text and plays it through the mixer.

await channel.audio.say('Please wait while I check that.', {
  queue: 0,
  alias: 'main-response',
  ttsVendor: 'E',
  ttsStrategy: 'sentence',
  ttsConfig: {
    voice_id: 'voice-id',
    output_format: 'pcm_16000',
  },
});

channel.audio.play(source, options?) plays raw audio from a URL/path or a LegacyPhraseRecord.

await channel.audio.play('/opt/prompts/welcome.wav', {
  queue: 1,
  alias: 'welcome-earcon',
});

channel.audio.presay(text, options?) pre-synthesizes TTS into the host TTS cache. If the cache is not available, the runtime logs a warning and resolves without throwing.

channel.audio.preload(source) decodes a raw audio source through the audio player. It does not synthesize TTS and does not populate the TTS cache used by presay().

TTS Strategies

ttsStrategy controls how text is chunked:

sentence: split on sentence boundaries and synthesize each sentence. This is the default.
streaming: send chunks incrementally for streaming-capable vendors.
full: accumulate the whole input and synthesize it as one segment after the input completes.

When using an Observable<string> input, WS clients also receive text progress events for streamed chunks.

Mixer Queues

The mixer has queues 0 through 4. Use separate queues for main speech, earcons, hold music, or background audio.

const music = channel.audio.queue(2);
music.volume = 0.25;

await channel.audio.play('/opt/audio/hold.wav', {
  queue: 2,
  alias: 'hold-music',
  loop: true,
});

channel.audio.stop(2);

PlayOptions.volume changes the whole queue volume, not just one item. stop(queue) clears a queue and aborts in-flight sentence TTS for that queue. stopAll() clears every queue.

For sentence-split TTS, queue item aliases are suffixed as alias-0, alias-1, and so on. Raw play() and direct streaming TTS use the alias exactly.

TTS Credentials And Saved Phrases

In Voctiv legacy compatibility mode, TTS credentials can be selected by PlayOptions.name or ttsConfig.name.

await channel.audio.say('Здравствуйте!', {
  name: 'main-tts-key',
  ttsConfig: {
    voice: 'alena',
  },
});

The runtime looks in channel.params.authentication_data.legacyTtsKeysByName[name]. If name is omitted, channel.params.defaultTtsName may be used.

legacySavePhrase stores synthesized audio under the Voctiv record phrase storage root and inserts phrase metadata so it can later be loaded with platform.getRecords().

await channel.audio.say('Welcome back.', {
  legacySavePhrase: {
    phraseName: 'welcome_back',
    flag: context.flag,
    language: context.language,
  },
});

const records = await platform.getRecords?.({
  phraseName: 'welcome_back',
  flag: context.flag,
  language: context.language,
});

if (records?.[0]) {
  await channel.audio.play(records[0]);
}

This requires legacy compatibility mode, a trusted LE agent id/UUID, TTS cache, and LEGACY_V3_RECORD_PHRASE_ROOT.

LLM API

channel.llm talks to the Omni LLM backend.

const answer = await channel.llm.ask('Summarize the user request', {
  role: 'assistant',
  hidden: true,
  agentUuid: context.agentUuid,
});

await channel.audio.say(answer);

For streaming:

const stream$ = channel.llm.stream('Answer briefly', {
  role: 'assistant',
});

await channel.audio.say(
  stream$.pipe(map((chunk) => chunk.content)),
  { ttsStrategy: 'streaming' },
);

channel.llm.extract(options?) runs structured extraction via Omni. makePersistentStream(options?) opens a long-lived Socket.IO stream and lets you send multiple turns without reconnecting.

Platform API

platform exposes Voctiv legacy-compatible operations.

platform.nlu.extract(utterance, options?) calls NLU v3 /infer. The runtime sends phrase, context, and agent_id. If options.context is omitted, current dialog params are serialized and used as NLU context.

const result = await platform.nlu.extract('I want to reschedule', {
  intents: ['reschedule', 'cancel'],
  entities: ['date', 'time'],
  use_synonyms: true,
});

Legacy platform APIs require context.legacyV3Compat === true. This includes NLU, outbound calls, dialog writes, messaging sends, and phrase records.

Dialog State

platform.dialog.entryPoint = 'on_recall';
platform.dialog.result = 'done';

Setters update the local value immediately and ask the platform DB to persist asynchronously. They are not awaitable and should not be used as transactional writes.

Outbound Calls

await platform.call('+12025551234', {
  date: new Date(Date.now() + 60_000),
  entryPoint: 'on_callback',
  recallCount: 2,
  recallDelay: 300,
  priority: 10,
});

This creates a row in the legacy call table. The dialer picks it up and originates the SIP call.

Messaging

await platform.messaging.send({
  src: 'bot',
  destination: '+12025551234',
  text: 'Your appointment is confirmed.',
});

Outbound messages are transported through legacy Redis streams. platform.messaging.message$ currently replays the inbound message that started a headless messaging script; it is not a live subscription to all future Redis messages.

Dialog Context And Persisted Env

context includes identity, telephony fields, params, routing metadata, and runtime helpers.

Important fields:

context.dialogUuid: current dialog UUID.
context.callerId / context.msisdn: caller identity.
context.destinationNumber: called number.
context.language / context.lang: language selected for the run.
context.flag: business flag.
context.initialData: shallow snapshot of params at script start.
context.dialogParams: live param map for the run.
context.entryPoint: current routing entry point.
context.headless: true for offline/queue/messaging sessions without a real media channel.
context.runTime: async execution budget helper.
context.env$: persisted dialog environment as an RxJS BehaviorSubject.

Use env$ for persisted script state:

const current = context.env$?.getValue() ?? {};
context.env$?.next({
  ...current,
  lastIntent: 'reschedule',
});

Do not return env from the script. The runtime snapshots context.env$ after completion and attaches it to the persisted result.

Logging And Debugging

Use logger.log(), warn(), error(), and debug() for structured logs.

logger.log('ASR result received', { text });
logger.warn('Low confidence intent', { confidence });

logger.enableDebug(endpoint) streams logs from the current script instance to a remote debug endpoint. logger.breakpoint(label, snapshot?) pauses only when an active debug session is connected; otherwise it resolves immediately.

WS And Headless Behavior

WS channels behave like active media channels:

channel.sip.state is effectively active.
sendDtmf() emits dtmf-send to the WS client.
sendMessage() emits a structured data event.
ASR reads socket audio frames or synthetic text input.

Headless channels are for offline, queue, or messaging sessions:

Audio methods are no-ops that log warnings.
SIP methods are mostly no-ops.
createAsr() returns an inert handle.
LLM and platform APIs still work.

Use context.headless to branch when a script must behave differently without a real media channel.

Text Input For Tests

channel.textInput injects synthetic ASR output into a live ASR handle.

const asr = await channel.createAsr();

channel.textInput.pushPartial(asr.id, 'hello', false);
channel.textInput.pushResult(asr.id, 'hello world');

This is mainly for WS debug clients and automated tests. Unknown ASR ids are ignored.

Package Notes

The package is published as CommonJS with TypeScript declarations in dist.

Build locally with:

npm run build

The package exports only the public SDK entry point:

import { defineScript, type MediaChannel, type AsrHandle } from '@voctiv/agent-sdk';