Package Exports
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (speechflow) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
SpeechFlow
Speech Processing Flow Graph
About
SpeechFlow is a command-line interface based tool for macOS, Windows and Linux, establishing a directed data flow graph of audio and text processing nodes. This way, it allows to perform various speech processing tasks in a very flexible and configurable way. The usual supported tasks are capturing audio, generate narrations of text (aka text-to-speech), generate transcriptions or subtitles for audio (aka speech-to-text), and generate translations for audio (aka speech-to-speech).
SpeechFlow comes with built-in graph nodes for local file I/O, local audio device I/O, remote WebSocket network I/O, remote MQTT network I/O, local Voice Activity Detection (VAD), local voice gender recognition, local audio LUFS-S/RMS metering, remote-controlable local audio muting, cloud-based Deepgram speech-to-text conversion, cloud-based ElevenLabs text-to-speech conversion, cloud-based DeepL text-to-text translation, cloud-based OpenAI/GPT text-to-text translation (or spelling correction), local Ollama/Gemma text-to-text translation (or spelling correction), local OPUS/ONNX text-to-text translation, local FFmpeg speech-to-speech encoding, local WAV speech-to-speech encoding, local text-to-text formatting, local text-to-text sentencing merging/splitting, local text-to-text subtitle generation, local text or audio filter, and local text or audio tracing.
Additional, SpeechFlow graph nodes can be provided externally
by NPM packages named speechflow-node-xxx
which expose a class
derived from the exported SpeechFlowNode
class of the speechflow
package.
SpeechFlow is written in TypeScript and ships as an installable package for the Node Package Manager (NPM).
Installation
$ npm install -g speechflow
Usage
$ speechflow
[-h|--help]
[-V|--version]
[-S|--status]
[-v|--verbose <level>]
[-a|--address <ip-address>]
[-p|--port <tcp-port>]
[-C|--cache <directory>]
[-e|--expression <expression>]
[-f|--file <file>]
[-c|--config <id>@<yaml-config-file>]
[<argument> [...]]
Graph Expression Language
The SpeechFlow graph expression language is based on FlowLink, which itself has a language following the following BNF-style grammar:
expr ::= parallel
| sequential
| node
| group
parallel ::= sequential ("," sequential)+
sequential ::= node ("|" node)+
node ::= id ("(" (param ("," param)*)? ")")?
param ::= array | object | variable | template | string | number | value
group ::= "{" expr "}"
id ::= /[a-zA-Z_][a-zA-Z0-9_-]*/
variable ::= id
array ::= "[" (param ("," param)*)? "]"
object ::= "{" (id ":" param ("," id ":" param)*)? "}"
template ::= "`" ("${" variable "}" / ("\\`"|.))* "`"
string ::= /"(\\"|.)*"/
| /'(\\'|.)*'/
number ::= /[+-]?/ number-value
number-value ::= "0b" /[01]+/
| "0o" /[0-7]+/
| "0x" /[0-9a-fA-F]+/
| /[0-9]*\.[0-9]+([eE][+-]?[0-9]+)?/
| /[0-9]+/
value ::= "true" | "false" | "null" | "NaN" | "undefined"
SpeechFlow makes available to FlowLink all SpeechFlow nodes as
node
, the CLI arguments under the array variable
named argv
, and all
environment variables under the object variable
named env
.
Processing Graph Examples
The following are examples of particular SpeechFlow processing graphs. They can also be found in the sample speechflow.yaml file.
Capturing: Capture audio from microphone device into WAV audio file:
device(device: "wasapi:VoiceMeeter Out B1", mode: "r") | wav(mode: "encode") | file(path: "capture.wav", mode: "w", type: "audio")
Pass-Through: Pass-through audio from microphone device to speaker device and in parallel record it to WAV audio file:
device(device: "wasapi:VoiceMeeter Out B1", mode: "r") | { wav(mode: "encode") | file(path: "capture.wav", mode: "w", type: "audio"), device(device: "wasapi:VoiceMeeter VAIO3 Input", mode: "w") }
Transcription: Generate text file with German transcription of MP3 audio file:
file(path: argv.0, mode: "r", type: "audio") | ffmpeg(src: "mp3", dst: "pcm") | deepgram(language: "de", key: env.SPEECHFLOW_DEEPGRAM_KEY) | format(width: 80) | file(path: argv.1, mode: "w", type: "text")
Subtitling: Generate text file with German subtitles of MP3 audio file:
file(path: argv.0, mode: "r", type: "audio") | ffmpeg(src: "mp3", dst: "pcm") | deepgram(language: "de", key: env.SPEECHFLOW_DEEPGRAM_KEY) | subtitle(format: "vtt") | file(path: argv.1, mode: "w", type: "text")
Speaking: Generate audio file with English voice for a text file:
file(path: argv.0, mode: "r", type: "text") | kokoro(language: "en") | wav(mode: "encode") | file(path: argv.1, mode: "w", type: "audio")
Ad-Hoc Translation: Ad-Hoc text translation from German to English via stdin/stdout:
file(path: "-", mode: "r", type: "text") | deepl(src: "de", dst: "en") | file(path: "-", mode: "w", type: "text")
Studio Translation: Real-time studio translation from German to English, including the capturing of all involved inputs and outputs:
device(device: "coreaudio:Elgato Wave:3", mode: "r") | { gender() | { meter(interval: 250) | wav(mode: "encode") | file(path: "program-de.wav", mode: "w", type: "audio"), deepgram(language: "de", key: env.SPEECHFLOW_DEEPGRAM_KEY) | { sentence() | { format(width: 80) | file(path: "program-de.txt", mode: "w", type: "text"), deepl(src: "de", dst: "en", key: env.SPEECHFLOW_DEEPL_KEY) | { trace(name: "text", type: "text") | { format(width: 80) | file(path: "program-en.txt", mode: "w", type: "text"), subtitle(format: "srt") | file(path: "program-en.srt", mode: "w", type: "text"), mqtt(url: "mqtt://10.1.0.10:1883", username: env.SPEECHFLOW_MQTT_USER, password: env.SPEECHFLOW_MQTT_PASS, topicWrite: "stream/studio/sender"), { filter(name: "S2T-male", type: "text", var: "meta:gender", op: "==", val: "male") | elevenlabs(voice: "Mark", optimize: "latency", speed: 1.05, language: "en"), filter(name: "S2T-female", type: "text", var: "meta:gender", op: "==", val: "female") | elevenlabs(voice: "Brittney", optimize: "latency", speed: 1.05, language: "en") } | { wav(mode: "encode") | file(path: "program-en.wav", mode: "w", type: "audio"), device(device: "coreaudio:USBAudio2.0", mode: "w") } } } } } } }
Processing Node Types
First a short overview of the available processing nodes:
- Input/Output nodes: file, device, websocket, mqtt.
- Audio-to-Audio nodes: ffmpeg, wav, mute, meter, vad, gender.
- Audio-to-Text nodes: deepgram.
- Text-to-Text nodes: deepl, openai, ollama, transformers, subtitle, format.
- Text-to-Audio nodes: elevenlabs.
- Any-to-Any nodes: filter, trace.
Input/Output Nodes
The following nodes are for external I/O, i.e, to read/write from external files, devices and network services.
Node: file
Purpose: File and StdIO source/sink
Example:file(path: "capture.pcm", mode: "w", type: "audio")
This node allows the reading/writing from/to files or from StdIO. It is intended to be used as source and sink nodes in batch processing, and as sing nodes in real-time processing.
Port Payload input text, audio output text, audio Parameter Position Default Requirement path 0 none none mode 1 "r" /^(?:r|w|rw)$/
type 2 "audio" /^(?:audio|text)$/
chunka 200 10 <= n <= 1000
chunkt 65536 1024 <= n <= 131072
Node: device
Purpose: Microphone/speaker device source/sink
Example:device(device: "wasapi:VoiceMeeter Out B1", mode: "r")
This node allows the reading/writing from/to audio devices. It is intended to be used as source nodes for microphone devices and as sink nodes for speaker devices.
Port Payload input audio output audio Parameter Position Default Requirement device 0 none /^(.+?):(.+)$/
mode 1 "rw" /^(?:r|w|rw)$/
chunk 2 200 10 <= n <= 1000
Node: websocket
Purpose: WebSocket source/sink
Example:websocket(connect: "ws://127.0.0.1:12345", type: "text")
Notice: this node requires a peer WebSocket service!This node allows reading/writing from/to WebSocket network services. It is primarily intended to be used for sending out the text of subtitles, but can be also used for receiving the text to be processed.
Port Payload input text, audio output text, audio Parameter Position Default Requirement listen none none /^(?:|ws:\/\/(.+?):(\d+))$/
connect none none /^(?:|ws:\/\/(.+?):(\d+)(?:\/.*)?)$/
type none "audio" /^(?:audio|text)$/
Node: mqtt
Purpose: MQTT sink
Example:mqtt(url: "mqtt://127.0.0.1:1883", username: "foo", password: "bar", topic: "quux")
Notice: this node requires a peer MQTT broker!This node allows reading/writing from/to MQTT broker topics. It is primarily intended to be used for sending out the text of subtitles, but can be also used for receiving the text to be processed.
Port Payload input text output none Parameter Position Default Requirement url 0 none /^(?:|(?:ws|mqtt):\/\/(.+?):(\d+))$/
username 1 none /^.+$/
password 2 none /^.+$/
topic 3 none /^.+$/
Audio-to-Audio Nodes
The following nodes process audio chunks only.
Node: ffmpeg
Purpose: FFmpeg audio format conversion
Example:ffmpeg(src: "pcm", dst: "mp3")
This node allows converting between audio formats. It is primarily intended to support the reading/writing of external MP3 and Opus format files, although SpeechFlow internally uses PCM format only.
Port Payload input audio output audio Parameter Position Default Requirement src 0 "pcm" /^(?:pcm|wav|mp3|opus)$/
dst 1 "wav" /^(?:pcm|wav|mp3|opus)$/
Node: wav
Purpose: WAV audio format conversion
Example:wav(mode: "encode")
This node allows converting between PCM and WAV audio formats. It is primarily intended to support the reading/writing of external WAV format files, although SpeechFlow internally uses PCM format only.
Port Payload input audio output audio Parameter Position Default Requirement mode 0 "encode" /^(?:encode|decode)$/
Node: mute
Purpose: volume muting node
Example:mute()
Notice: this node has to be externally controlled via REST/WebSockets!This node allows muting the audio stream by either silencing or even unplugging. It has to be externally controlled via REST/WebSocket (see below).
Port Payload input audio output audio Parameter Position Default Requirement Node: meter
Purpose: Loudness metering node
Example:meter(250)
This node allows measuring the loudness of the audio stream. The results are emitted to both the logfile of SpeechFlow and the WebSockets API (see below).
Port Payload input audio output audio Parameter Position Default Requirement interval 0 250 none Node: vad
Purpose: Voice Audio Detection (VAD) node
Example:vad()
This node perform Voice Audio Detection (VAD), i.e., it detects voice in the audio stream and if not detected either silences or unplugs the audio stream.
Port Payload input audio output audio Parameter Position Default Requirement mode none "unplugged" /^(?:silenced|unplugged)$/
posSpeechThreshold none 0.50 none negSpeechThreshold none 0.35 none minSpeechFrames none 2 none redemptionFrames none 12 none preSpeechPadFrames none 1 none postSpeechTail none 1500 none Node: gender
Purpose: Gender Detection node
Example:gender()
This node performs gender detection on the audio stream. It annotates the audio chunks with
gender=male
orgender=female
meta information. Use this meta information with the "filter" node.Port Payload input audio output audio Parameter Position Default Requirement window 0 500 none
Audio-to-Text Nodes
The following nodes convert audio to text chunks.
Node: deepgram
Purpose: Deepgram Speech-to-Text conversion
Example:deepgram(language: "de")
Notice: this node requires an API key!This node performs Speech-to-Text (S2T) conversion, i.e., it recognizes speech in the input audio stream and outputs a corresponding text stream.
Port Payload input audio output text Parameter Position Default Requirement key none env.SPEECHFLOW_DEEPGRAM_KEY none keyAdm none env.SPEECHFLOW_DEEPGRAM_KEY_ADM none model 0 "nova-3" none version 1 "latest" none language 2 "multi" none
Text-to-Text Nodes
The following nodes process text chunks only.
Node: deepl
Purpose: DeepL Text-to-Text translation
Example:deepl(src: "de", dst: "en")
Notice: this node requires an API key!This node performs translation between English and German languages.
Port Payload input text output text Parameter Position Default Requirement key none env.SPEECHFLOW_DEEPL_KEY none src 0 "de" /^(?:de|en)$/
dst 1 "en" /^(?:de|en)$/
Node: openai
Purpose: OpenAI/GPT Text-to-Text translation and spelling correction
Example:openai(src: "de", dst: "en")
Notice: this node requires an OpenAI API key!This node performs translation between English and German languages in the text stream or (if the source and destination language is the same) spellchecking of English or German languages in the text stream. It is based on the remote OpenAI cloud AI service and uses the GPT-4o-mini LLM.
Port Payload input text output text Parameter Position Default Requirement api none "https://api.openai.com" /^https?:\/\/.+?:\d+$/
src 0 "de" /^(?:de|en)$/
dst 1 "en" /^(?:de|en)$/
key none env.SPEECHFLOW_OPENAI_KEY none model none "gpt-4o-mini" none Node: ollama
Purpose: Ollama/Gemma Text-to-Text translation and spelling correction
Example:ollama(src: "de", dst: "en")
Notice: this node requires Ollama to be installed!This node performs translation between English and German languages in the text stream or (if the source and destination language is the same) spellchecking of English or German languages in the text stream. It is based on the local Ollama AI service and uses the Google Gemma 3 LLM.
Port Payload input text output text Parameter Position Default Requirement api none "http://127.0.0.1:11434" /^https?:\/\/.+?:\d+$/
model none "gemma3:4b-it-q4_K_M" none src 0 "de" /^(?:de|en)$/
dst 1 "en" /^(?:de|en)$/
Node: transformers
Purpose: Transformers Text-to-Text translation
Example:transformers(src: "de", dst: "en")
This node performs translation between English and German languages in the text stream. It is based on local OPUS or SmolLM3 LLMs.
Port Payload input text output text Parameter Position Default Requirement model none "OPUS" /^(?:OPUS|SmolLM3)$/
src 0 "de" /^(?:de|en)$/
dst 1 "en" /^(?:de|en)$/
Node: sentence
Purpose: sentence splitting/merging
Example:sentence()
This node allows you to ensure that a text stream is split or merged into complete sentences. It is primarily intended to be used after the "deepgram" node and before "deepl" or "elevenlabs" nodes in order to improve overall quality.
Port Payload input text output text Parameter Position Default Requirement Node: subtitle
Purpose: SRT/VTT Subtitle Generation
Example:subtitle(format: "srt")
This node generates subtitles from the text stream (and its embedded timestamps) in the formats SRT (SubRip) or VTT (WebVTT).
Port Payload input text output text Parameter Position Default Requirement format none "srt" /^(?:srt|vtt)$/ words none false none Node: format
Purpose: text paragraph formatting
Example:format(width: 80)
This node formats the text stream into lines no longer than a certain width. It is primarily intended for use before writing text chunks to files.
Port Payload input text output text Parameter Position Default Requirement width 0 80 none
Text-to-Audio Nodes
The following nodes convert text chunks to audio chunks.
Node: elevenlabs
Purpose: ElevenLabs Text-to-Speech conversion
Example:elevenlabs(language: "en")
Notice: this node requires an ElevenLabs API key!This node perform Text-to-Speech (T2S) conversion, i.e., it converts the input text stream into an output audio stream. It is intended to generate speech.
Port Payload input text output audio Parameter Position Default Requirement key none env.SPEECHFLOW_ELEVENLABS_KEY none voice 0 "Brian" /^(?:Brittney|Cassidy|Leonie|Mark|Brian)$/
language 1 "de" /^(?:de|en)$/
speed 2 1.00 n >= 0
7 && n <= 1.2`stability 3 0.5 n >= 0.0 && n <= 1.0
similarity 4 0.75 n >= 0.0 && n <= 1.0
optimize 5 "latency" /^(?:latency|quality)$/
Node: kokoro
Purpose: Kokoro Text-to-Speech conversion
Example:kokoro(language: "en")
Notice: this currently support English language only!This node perform Text-to-Speech (T2S) conversion, i.e., it converts the input text stream into an output audio stream. It is intended to generate speech.
Port Payload input text output audio Parameter Position Default Requirement voice 0 "Aoede" /^(?:Aoede|Heart|Puck|Fenrir)$/
language 1 "en" /^en$/
speed 2 1.25 1.0...1.30
Any-to-Any Nodes
The following nodes process any type of chunk, i.e., both audio and text chunks.
Node: filter
Purpose: meta information based filter
Example:filter(type: "audio", var: "meta:gender", op: "==", val: "male")
This node allows you to filter nodes based on certain criteria. It is primarily intended to be used in conjunction with the "gender" node and in front of the
elevenlabs
orkokoro
nodes in order to translate with a corresponding voice.Port Payload input text, audio output text, audio Parameter Position Default Requirement type 0 "audio" /^(?:audio|text)$/
name 1 "filter" /^.+$/
var 2 "" /^(?:meta:.+|payload:(?:length|text)|time:(?:start|end))$/
op 3 "==" /^(?:<|<=|==|!=|~~|!~|>=|>)$/
val 4 "" /^.*$/
Node: trace
Purpose: data flow tracing
Example:trace(type: "audio")
This node allows you to trace the audio and text chunk flow through the SpeechFlow graph. It just passes through its chunks, but sends information about the chunks to the log.
Port Payload input text, audio output text, audio Parameter Position Default Requirement type 0 "audio" /^(?:audio|text)$/
name 1 none none
REST/WebSocket API
SpeechFlow has an externally exposed REST/WebSockets API which can be used to control the nodes and to receive information from nodes. For controlling a node you have three possibilities (illustrated by controlling the mode of the "mute" node):
# use HTTP/REST/GET:
$ curl http://127.0.0.1:8484/api/COMMAND/mute/mode/silenced
# use HTTP/REST/POST:
$ curl -H "Content-type: application/json" \
--data '{ "request": "COMMAND", "node": "mute", "args": [ "mode", "silenced" ] }' \
http://127.0.0.1:8484/api
# use WebSockets:
$ wscat -c ws://127.0.0.1:8484/api \
> { "request": "COMMAND", "node": "mute", "args": [ "mode", "silenced" ] }
For receiving emitted information from nodes, you have to use the WebSockets API (illustrated by the emitted information of the "meter" node):
# use WebSockets:
$ wscat -c ws://127.0.0.1:8484/api \
< { "response": "NOTIFY", "node": "meter", "args": [ "meter", "LUFS-S", -35.75127410888672 ] }
History
SpeechFlow, as a technical cut-through, was initially created in March 2024 for use in the msg Filmstudio context. It was later refined into a more complete toolkit in April 2025 and this way the first time could be used in production. It was fully refactored in July 2025 in order to support timestamps in the streams processing.
Copyright & License
Copyright © 2024-2025 Dr. Ralf S. Engelschall
Licensed under GPL 3.0