Package Exports

speechflow

Readme

SpeechFlow

Speech Processing Flow Graph

About

SpeechFlow is a command-line interface based tool for establishing a directed data flow graph of audio and text processing nodes. This way, it allows to perform various speech processing tasks in a flexible way.

SpeechFlow comes with built-in graph nodes for local file I/O, local audio device I/O, remote WebSocket network I/O, remote MQTT network I/O, cloud-based Deepgram speech-to-text conversion, cloud-based ElevenLabs text-to-speech conversion, cloud-based DeepL text-to-text translation, local Gemma/Ollama text-to-text translation, local Gemma/Ollama text-to-text spelling correction, local OPUS/ONNX text-to-text translation, local FFmpeg speech-to-speech encoding, local WAV speech-to-speech encoding, local text-to-text formatting, local text-to-text subtitle generation, and local text or audio tracing.

Additional SpeechFlow graph nodes can be provided externally by NPM packages named speechflow-node-xxx which expose a class derived from the exported SpeechFlowNode class of the speechflow package.

SpeechFlow is written in TypeScript and ships as an installable package for the Node Package Manager (NPM).

Installation

$ npm install -g speechflow

Usage

$ speechflow
  [-h|--help]
  [-V|--version]
  [-v|--verbose <level>]
  [-e|--expression <expression>]
  [-f|--file <file>]
  [-c|--config <id>@<yaml-config-file>]
  [<argument> [...]]

Processing Graph Examples

The following are examples of SpeechFlow processing graphs. They can also be found in the sample.yaml file for easy consumption with speechflow -c <id>@sample.yaml>.

Capturing: Capture audio from microphone device into WAV audio file:

device(device: "wasapi:VoiceMeeter Out B1", mode: "r") |
    wav(mode: "encode") |
        file(path: "capture.wav", mode: "w", type: "audio")

Pass-Through: Pass-through audio from microphone device to speaker device and in parallel record it to WAV audio file:

device(device: "wasapi:VoiceMeeter Out B1", mode: "r") | {
    wav(mode: "encode") |
        file(path: "capture.wav", mode: "w", type: "audio"),
    device(device: "wasapi:VoiceMeeter VAIO3 Input", mode: "w")
}

Narration: Generate text file with German narration of MP3 audio file:

file(path: argv.0, mode: "r", type: "audio") |
    ffmpeg(src: "mp3", dst: "pcm") |
        deepgram(language: "de", key: env.SPEECHFLOW_KEY_DEEPGRAM) |
            format(width: 80) |
                file(path: argv.1, mode: "w", type: "text")

Subtitling: Generate text file with German subtitles of MP3 audio file:

file(path: argv.0, mode: "r", type: "audio") |
    ffmpeg(src: "mp3", dst: "pcm") |
        deepgram(language: "de", key: env.SPEECHFLOW_KEY_DEEPGRAM) |
            subtitle(format: "vtt") |
                file(path: argv.1, mode: "w", type: "text")

Ad-Hoc Translation: Ad-Hoc text translation from German to English via stdin/stdout:

file(path: "-", mode: "r", type: "text") |
    deepl(src: "de", dst: "en") |
        file(path: "-", mode: "w", type: "text")

Studio Translation: Real-time studio translation from German to English, including the capturing of all involved inputs and outputs:

device(device: "coreaudio:Elgato Wave:3", mode: "r") | {
    wav(mode: "encode") |
        file(path: "program-de.wav", mode: "w", type: "audio"),
    deepgram(key: env.SPEECHFLOW_KEY_DEEPGRAM, language: "de") | {
        format(width: 80) |
            file(path: "program-de.txt", mode: "w", type: "text"),
        deepl(key: env.SPEECHFLOW_KEY_DEEPL, src: "de", dst: "en") | {
            format(width: 80) |
                file(path: "program-en.txt", mode: "w", type: "text"),
            subtitle(format: "vtt") | {
                file(path: "program-en.vtt", mode: "w", type: "text"),
                mqtt(url: "mqtt://10.1.0.10:1883",
                    username: env.SPEECHFLOW_MQTT_USER,
                    password: env.SPEECHFLOW_MQTT_PASS,
                    topicWrite: "stream/studio/sender")
            },
            subtitle(format: "srt") |
                file(path: "program-en.srt", mode: "w", type: "text"),
            elevenlabs(voice: "Mark", speed: 1.05, language: "en") | {
                wav(mode: "encode") |
                    file(path: "program-en.wav", mode: "w", type: "audio"),
                device(device: "coreaudio:USBAudio2.0", mode: "w")
            }
        }
    }
}

Processing Node Types

First a short overview of the available processing nodes:

Input/Output nodes: file, device, websocket, mqtt.
Audio-to-Audio nodes: ffmpeg, wav.
Audio-to-Text nodes: deepgram.
Text-to-Text nodes: deepl, gemma, opus, subtitle, format.
Text-to-Audio nodes: elevenlabs.
Any-to-Any nodes: trace.

Input/Output Nodes:

Node: file
Purpose: File and StdIO source/sink
Example: file(path: "capture.pcm", mode: "w", type: "audio")

Port Payload

input text, audio

output text, audio

Parameter Position Default Requirement

path 0 none none

mode 1 "r" /^(?:r|w|rw)$/

type 2 "audio" /^(?:audio|text)$/
Node: device
Purpose: Microphone/speaker device source/sink
Example: device(device: "wasapi:VoiceMeeter Out B1", mode: "r")

Port Payload

input audio

output audio

Parameter Position Default Requirement

device 0 none /^(.+?):(.+)$/

mode 1 "rw" /^(?:r|w|rw)$/
Node: websocket
Purpose: WebSocket source/sink
Example: websocket(connect: "ws://127.0.0.1:12345", type: "text") Notice: this node requires a peer WebSocket service!

Port Payload

input text, audio

output text, audio

Parameter Position Default Requirement

listen none none /^(?:|ws:\/\/(.+?):(\d+))$/

connect none none /^(?:|ws:\/\/(.+?):(\d+)(?:\/.*)?)$/

type none "audio" /^(?:audio|text)$/
Node: mqtt
Purpose: MQTT sink
Example: mqtt(url: "mqtt://127.0.0.1:1883", username: "foo", password: "bar", topic: "quux") Notice: this node requires a peer MQTT broker!

Port Payload

input text

output none

Parameter Position Default Requirement

url 0 none `/^(?:|(?:ws

username 1 none /^.+$/

password 2 none /^.+$/

topic 3 none /^.+$/

Port	Payload
input	text, audio
output	text, audio

Parameter	Position	Default	Requirement
path	0	none	none
mode	1	"r"	`/^(?:r\|w\|rw)$/`
type	2	"audio"	`/^(?:audio\|text)$/`

Port	Payload
input	audio
output	audio

Parameter	Position	Default	Requirement
device	0	none	`/^(.+?):(.+)$/`
mode	1	"rw"	`/^(?:r\|w\|rw)$/`

Port	Payload
input	text, audio
output	text, audio

Parameter	Position	Default	Requirement
listen	none	none	`/^(?:\|ws:\/\/(.+?):(\d+))$/`
connect	none	none	`/^(?:\|ws:\/\/(.+?):(\d+)(?:\/.*)?)$/`
type	none	"audio"	`/^(?:audio\|text)$/`

Port	Payload
input	text
output	none

Parameter	Position	Default	Requirement
url	0	none	`/^(?:\|(?:ws
username	1	none	`/^.+$/`
password	2	none	`/^.+$/`
topic	3	none	`/^.+$/`

Audio-to-Audio Nodes:

Node: ffmpeg
Purpose: FFmpeg audio format conversion
Example: ffmpeg(src: "pcm", dst: "mp3")

Port Payload

input audio

output audio

Parameter Position Default Requirement

src 0 "pcm" /^(?:pcm|wav|mp3|opus)$/

dst 1 "wav" /^(?:pcm|wav|mp3|opus)$/
Node: wav
Purpose: WAV audio format conversion
Example: wav(mode: "encode")

Port Payload

input audio

output audio

Parameter Position Default Requirement

mode 0 "encode" /^(?:encode|decode)$/

Port	Payload
input	audio
output	audio

Parameter	Position	Default	Requirement
src	0	"pcm"	`/^(?:pcm\|wav\|mp3\|opus)$/`
dst	1	"wav"	`/^(?:pcm\|wav\|mp3\|opus)$/`

Port	Payload
input	audio
output	audio

Parameter	Position	Default	Requirement
mode	0	"encode"	`/^(?:encode\|decode)$/`

Audio-to-Text Nodes:

Node: deepgram
Purpose: Deepgram Speech-to-Text conversion
Example: deepgram(language: "de")
Notice: this node requires an API key!

Port Payload

input audio

output text

Parameter Position Default Requirement

key none env.SPEECHFLOW_KEY_DEEPGRAM none

model 0 "nova-3" none

version 1 "latest" none

language 2 "multi" none

Port	Payload
input	audio
output	text

Parameter	Position	Default	Requirement
key	none	env.SPEECHFLOW_KEY_DEEPGRAM	none
model	0	"nova-3"	none
version	1	"latest"	none
language	2	"multi"	none

Text-to-Text Nodes:

Node: deepl
Purpose: DeepL Text-to-Text translation
Example: deepl(src: "de", dst: "en")
Notice: this node requires an API key!

Port Payload

input text

output text

Parameter Position Default Requirement

key none env.SPEECHFLOW_KEY_DEEPL none

src 0 "de" /^(?:de|en)$/

dst 1 "en" /^(?:de|en)$/
Node: gemma
Purpose: Google Gemma Text-to-Text translation and spelling correction
Example: gemma(src: "de", dst: "en")
Notice; this node requires the Ollama API!

Port Payload

input text

output text

Parameter Position Default Requirement

url none "http://127.0.0.1:11434" /^https?:\/\/.+?:\d+$/

src 0 "de" /^(?:de|en)$/

dst 1 "en" /^(?:de|en)$/
Node: opus
Purpose: OPUS Text-to-Text translation
Example: deepl(src: "de", dst: "en")

Port Payload

input text

output text

Parameter Position Default Requirement

src 0 "de" /^(?:de|en)$/

dst 1 "en" /^(?:de|en)$/
Node: subtitle
Purpose: SRT/VTT Subtitle Generation
Example: subtitle(format: "srt")

Port Payload

input text

output text

Parameter Position Default Requirement

format none "srt" /^(?:srt|vtt)$/
Node: format
Purpose: text paragraph formatting
Example: format(width: 80)

Port Payload

input text

output text

Parameter Position Default Requirement

width 0 80 none

Port	Payload
input	text
output	text

Parameter	Position	Default	Requirement
key	none	env.SPEECHFLOW_KEY_DEEPL	none
src	0	"de"	`/^(?:de\|en)$/`
dst	1	"en"	`/^(?:de\|en)$/`

Port	Payload
input	text
output	text

Parameter	Position	Default	Requirement
url	none	"http://127.0.0.1:11434"	`/^https?:\/\/.+?:\d+$/`
src	0	"de"	`/^(?:de\|en)$/`
dst	1	"en"	`/^(?:de\|en)$/`

Port	Payload
input	text
output	text

Parameter	Position	Default	Requirement
src	0	"de"	`/^(?:de\|en)$/`
dst	1	"en"	`/^(?:de\|en)$/`

Port	Payload
input	text
output	text

Parameter	Position	Default	Requirement
format	none	"srt"	/^(?:srt\|vtt)$/

Port	Payload
input	text
output	text

Parameter	Position	Default	Requirement
width	0	80	none

Text-to-Audio Nodes:

Node: elevenlabs
Purpose: ElevenLabs Text-to-Speech conversion
Example: elevenlabs(language: "en")
Notice: this node requires an API key!

Port Payload

input text

output audio

Parameter Position Default Requirement

key none env.SPEECHFLOW_KEY_ELEVENLABS none

voice 0 "Brian" none

language 1 "de" none

Port	Payload
input	text
output	audio

Parameter	Position	Default	Requirement
key	none	env.SPEECHFLOW_KEY_ELEVENLABS	none
voice	0	"Brian"	none
language	1	"de"	none

Any-to-Any Nodes:

Node: trace
Purpose: data flow tracing
Example: trace(type: "audio")

Port Payload

input text, audio

output text, audio

Parameter Position Default Requirement

type 0 "audio" /^(?:audio|text)$/

name 1 none none

Port	Payload
input	text, audio
output	text, audio

Parameter	Position	Default	Requirement
type	0	"audio"	`/^(?:audio\|text)$/`
name	1	none	none

Graph Expression Language

The SpeechFlow graph expression language is based on FlowLink, which itself has a language following the following BNF-style grammar:

expr             ::= parallel
                   | sequential
                   | node
                   | group
parallel         ::= sequential ("," sequential)+
sequential       ::= node ("|" node)+
node             ::= id ("(" (param ("," param)*)? ")")?
param            ::= array | object | variable | template | string | number | value
group            ::= "{" expr "}"
id               ::= /[a-zA-Z_][a-zA-Z0-9_-]*/
variable         ::= id
array            ::= "[" (param ("," param)*)? "]"
object           ::= "{" (id ":" param ("," id ":" param)*)? "}"
template         ::= "`" ("${" variable "}" / ("\\`"|.))* "`"
string           ::= /"(\\"|.)*"/
                   | /'(\\'|.)*'/
number           ::= /[+-]?/ number-value
number-value     ::= "0b" /[01]+/
                   | "0o" /[0-7]+/
                   | "0x" /[0-9a-fA-F]+/
                   | /[0-9]*\.[0-9]+([eE][+-]?[0-9]+)?/
                   | /[0-9]+/
value            ::= "true" | "false" | "null" | "NaN" | "undefined"

SpeechFlow makes available to FlowLink all SpeechFlow nodes as node, the CLI arguments under the array variable named argv, and all environment variables under the object variable named env.

History

Speechflow, as a technical cut-through, was initially created in March 2024 for use in the msg Filmstudio context. It was later refined into a more complete toolkit in April 2025 and this way the first time could be used in production. It was fully refactored in July 2025 in order to support timestamps in the streams processing.

JSPM

speechflow