JSPM

  • Created
  • Published
  • Downloads 100
  • Score
    100M100P100Q77421F
  • License GPL-3.0-only

Speech Processing Flow Graph

Package Exports

  • speechflow

Readme

SpeechFlow

Speech Processing Flow Graph

github (author stars) github (author followers) github (project stdver) github (project release)

About

SpeechFlow is a command-line interface based tool for establishing a directed data flow graph of audio and text processing nodes. This way, it allows to perform various speech processing tasks in a very flexible and configurable way. The usual supported tasks are capturing audio, generate narrations of text (aka text-to-speech), generate transcriptions or subtitles for audio (aka speech-to-text), and generate translations for audio (aka speech-to-speech).

SpeechFlow comes with built-in graph nodes for local file I/O, local audio device I/O, remote WebSocket network I/O, remote MQTT network I/O, cloud-based Deepgram speech-to-text conversion, cloud-based ElevenLabs text-to-speech conversion, cloud-based DeepL text-to-text translation, cloud-based OpenAI/GPT text-to-text translation (or spelling correction), local Ollama/Gemma text-to-text translation (or spelling correction), local OPUS/ONNX text-to-text translation, local FFmpeg speech-to-speech encoding, local WAV speech-to-speech encoding, local text-to-text formatting, local text-to-text subtitle generation, and local text or audio tracing.

Additional SpeechFlow graph nodes can be provided externally by NPM packages named speechflow-node-xxx which expose a class derived from the exported SpeechFlowNode class of the speechflow package.

SpeechFlow is written in TypeScript and ships as an installable package for the Node Package Manager (NPM).

Installation

$ npm install -g speechflow

Usage

$ speechflow
  [-h|--help]
  [-V|--version]
  [-v|--verbose <level>]
  [-e|--expression <expression>]
  [-f|--file <file>]
  [-c|--config <id>@<yaml-config-file>]
  [<argument> [...]]

Processing Graph Examples

The following are examples of SpeechFlow processing graphs. They can also be found in the sample speechflow.yaml file.

  • Capturing: Capture audio from microphone device into WAV audio file:

    device(device: "wasapi:VoiceMeeter Out B1", mode: "r") |
        wav(mode: "encode") |
            file(path: "capture.wav", mode: "w", type: "audio")
  • Pass-Through: Pass-through audio from microphone device to speaker device and in parallel record it to WAV audio file:

    device(device: "wasapi:VoiceMeeter Out B1", mode: "r") | {
        wav(mode: "encode") |
            file(path: "capture.wav", mode: "w", type: "audio"),
        device(device: "wasapi:VoiceMeeter VAIO3 Input", mode: "w")
    }
  • Transcription: Generate text file with German transcription of MP3 audio file:

    file(path: argv.0, mode: "r", type: "audio") |
        ffmpeg(src: "mp3", dst: "pcm") |
            deepgram(language: "de", key: env.SPEECHFLOW_DEEPGRAM_KEY) |
                format(width: 80) |
                    file(path: argv.1, mode: "w", type: "text")
  • Subtitling: Generate text file with German subtitles of MP3 audio file:

    file(path: argv.0, mode: "r", type: "audio") |
        ffmpeg(src: "mp3", dst: "pcm") |
            deepgram(language: "de", key: env.SPEECHFLOW_DEEPGRAM_KEY) |
                subtitle(format: "vtt") |
                    file(path: argv.1, mode: "w", type: "text")
  • Speaking: Generate audio file with English voice for a text file:

    file(path: argv.0, mode: "r", type: "text") |
        kokoro(language: "en") |
            wav(mode: "encode") |
                file(path: argv.1, mode: "w", type: "audio")
  • Ad-Hoc Translation: Ad-Hoc text translation from German to English via stdin/stdout:

    file(path: "-", mode: "r", type: "text") |
        deepl(src: "de", dst: "en") |
            file(path: "-", mode: "w", type: "text")
  • Studio Translation: Real-time studio translation from German to English, including the capturing of all involved inputs and outputs:

    device(device: "coreaudio:Elgato Wave:3", mode: "r") | {
        wav(mode: "encode") |
            file(path: "program-de.wav", mode: "w", type: "audio"),
        deepgram(key: env.SPEECHFLOW_DEEPGRAM_KEY, language: "de") | {
            format(width: 80) |
                file(path: "program-de.txt", mode: "w", type: "text"),
            deepl(key: env.SPEECHFLOW_DEEPL_KEY, src: "de", dst: "en") | {
                format(width: 80) |
                    file(path: "program-en.txt", mode: "w", type: "text"),
                subtitle(format: "vtt") | {
                    file(path: "program-en.vtt", mode: "w", type: "text"),
                    mqtt(url: "mqtt://10.1.0.10:1883",
                        username: env.SPEECHFLOW_MQTT_USER,
                        password: env.SPEECHFLOW_MQTT_PASS,
                        topicWrite: "stream/studio/sender")
                },
                subtitle(format: "srt") |
                    file(path: "program-en.srt", mode: "w", type: "text"),
                elevenlabs(voice: "Mark", speed: 1.05, language: "en") | {
                    wav(mode: "encode") |
                        file(path: "program-en.wav", mode: "w", type: "audio"),
                    device(device: "coreaudio:USBAudio2.0", mode: "w")
                }
            }
        }
    }

Processing Node Types

First a short overview of the available processing nodes:

  • Input/Output nodes: file, device, websocket, mqtt.
  • Audio-to-Audio nodes: ffmpeg, wav, mute, meter, vad, gender.
  • Audio-to-Text nodes: deepgram.
  • Text-to-Text nodes: deepl, openai, ollama, transformers, subtitle, format.
  • Text-to-Audio nodes: elevenlabs.
  • Any-to-Any nodes: filter, trace.

Input/Output Nodes:

  • Node: file
    Purpose: File and StdIO source/sink
    Example: file(path: "capture.pcm", mode: "w", type: "audio")

    Port Payload
    input text, audio
    output text, audio
    Parameter Position Default Requirement
    path 0 none none
    mode 1 "r" /^(?:r|w|rw)$/
    type 2 "audio" /^(?:audio|text)$/
    chunka 200 10 <= n <= 1000
    chunkt 65536 1024 <= n <= 131072
  • Node: device
    Purpose: Microphone/speaker device source/sink
    Example: device(device: "wasapi:VoiceMeeter Out B1", mode: "r")

    Port Payload
    input audio
    output audio
    Parameter Position Default Requirement
    device 0 none /^(.+?):(.+)$/
    mode 1 "rw" /^(?:r|w|rw)$/
    chunk 2 200 10 <= n <= 1000
  • Node: websocket
    Purpose: WebSocket source/sink
    Example: websocket(connect: "ws://127.0.0.1:12345", type: "text") Notice: this node requires a peer WebSocket service!

    Port Payload
    input text, audio
    output text, audio
    Parameter Position Default Requirement
    listen none none /^(?:|ws:\/\/(.+?):(\d+))$/
    connect none none /^(?:|ws:\/\/(.+?):(\d+)(?:\/.*)?)$/
    type none "audio" /^(?:audio|text)$/
  • Node: mqtt
    Purpose: MQTT sink
    Example: mqtt(url: "mqtt://127.0.0.1:1883", username: "foo", password: "bar", topic: "quux") Notice: this node requires a peer MQTT broker!

    Port Payload
    input text
    output none
    Parameter Position Default Requirement
    url 0 none `/^(?:|(?:ws
    username 1 none /^.+$/
    password 2 none /^.+$/
    topic 3 none /^.+$/

Audio-to-Audio Nodes:

  • Node: ffmpeg
    Purpose: FFmpeg audio format conversion
    Example: ffmpeg(src: "pcm", dst: "mp3")

    Port Payload
    input audio
    output audio
    Parameter Position Default Requirement
    src 0 "pcm" /^(?:pcm|wav|mp3|opus)$/
    dst 1 "wav" /^(?:pcm|wav|mp3|opus)$/
  • Node: wav
    Purpose: WAV audio format conversion
    Example: wav(mode: "encode")

    Port Payload
    input audio
    output audio
    Parameter Position Default Requirement
    mode 0 "encode" /^(?:encode|decode)$/
  • Node: mute
    Purpose: volume muting node
    Example: mute() Notice: this node has to be externally controlled via REST/WebSockets!

    Port Payload
    input audio
    output audio
    Parameter Position Default Requirement
  • Node: meter
    Purpose: Loudness metering node
    Example: meter(250)

    Port Payload
    input audio
    output audio
    Parameter Position Default Requirement
    interval 0 250 none
  • Node: vad
    Purpose: Voice Audio Detection (VAD) node
    Example: vad()

    Port Payload
    input audio
    output audio
    Parameter Position Default Requirement
    mode none "unplugged" `/^(?:silenced
    posSpeechThreshold none 0.50 none
    negSpeechThreshold none 0.35 none
    minSpeechFrames none 2 none
    redemptionFrames none 12 none
    preSpeechPadFrames none 1 none
  • Node: gender
    Purpose: Gender Detection node
    Example: gender()

    Port Payload
    input audio
    output audio
    Parameter Position Default Requirement
    window 0 500 none

Audio-to-Text Nodes:

  • Node: deepgram
    Purpose: Deepgram Speech-to-Text conversion
    Example: deepgram(language: "de")
    Notice: this node requires an API key!

    Port Payload
    input audio
    output text
    Parameter Position Default Requirement
    key none env.SPEECHFLOW_DEEPGRAM_KEY none
    keyAdm none env.SPEECHFLOW_DEEPGRAM_KEY_ADM none
    model 0 "nova-3" none
    version 1 "latest" none
    language 2 "multi" none

Text-to-Text Nodes:

  • Node: deepl
    Purpose: DeepL Text-to-Text translation
    Example: deepl(src: "de", dst: "en")
    Notice: this node requires an API key!

    Port Payload
    input text
    output text
    Parameter Position Default Requirement
    key none env.SPEECHFLOW_DEEPL_KEY none
    src 0 "de" /^(?:de|en)$/
    dst 1 "en" /^(?:de|en)$/
  • Node: openai
    Purpose: OpenAI/GPT Text-to-Text translation and spelling correction
    Example: openai(src: "de", dst: "en")
    Notice: this node requires an OpenAI API key!

    Port Payload
    input text
    output text
    Parameter Position Default Requirement
    api none "https://api.openai.com" /^https?:\/\/.+?:\d+$/
    src 0 "de" /^(?:de|en)$/
    dst 1 "en" /^(?:de|en)$/
    key none env.SPEECHFLOW_OPENAI_KEY none
    model none "gpt-4o-mini" none
  • Node: ollama
    Purpose: Ollama/Gemma Text-to-Text translation and spelling correction
    Example: ollama(src: "de", dst: "en")
    Notice: this node requires the Ollama API!

    Port Payload
    input text
    output text
    Parameter Position Default Requirement
    api none "http://127.0.0.1:11434" /^https?:\/\/.+?:\d+$/
    model none "gemma3:4b-it-q4_K_M" none
    src 0 "de" /^(?:de|en)$/
    dst 1 "en" /^(?:de|en)$/
  • Node: transformers
    Purpose: Transformers Text-to-Text translation
    Example: transformers(src: "de", dst: "en")

    Port Payload
    input text
    output text
    Parameter Position Default Requirement
    model none "OPUS" `/^(?:OPUS
    src 0 "de" /^(?:de|en)$/
    dst 1 "en" /^(?:de|en)$/
  • Node: subtitle
    Purpose: SRT/VTT Subtitle Generation
    Example: subtitle(format: "srt")

    Port Payload
    input text
    output text
    Parameter Position Default Requirement
    format none "srt" /^(?:srt|vtt)$/
  • Node: format
    Purpose: text paragraph formatting
    Example: format(width: 80)

    Port Payload
    input text
    output text
    Parameter Position Default Requirement
    width 0 80 none

Text-to-Audio Nodes:

  • Node: elevenlabs
    Purpose: ElevenLabs Text-to-Speech conversion
    Example: elevenlabs(language: "en")
    Notice: this node requires an API key!

    Port Payload
    input text
    output audio
    Parameter Position Default Requirement
    key none env.SPEECHFLOW_ELEVENLABS_KEY none
    voice 0 "Brian" none
    language 1 "de" none
  • Node: kokoro
    Purpose: Kokoro Text-to-Speech conversion
    Example: kokoro(language: "en")
    Notice: this currently support English language only!

    Port Payload
    input text
    output audio
    Parameter Position Default Requirement
    voice 0 "Aoede" `/^(?:Aoede
    language 1 "en" /^en$/
    speed 2 1.25 1.0...1.30

Any-to-Any Nodes:

  • Node: filter
    Purpose: meta information based filter
    Example: filter(type: "audio", var: "meta:gender", op: "==", val: "male")

    Port Payload
    input text, audio
    output text, audio
    Parameter Position Default Requirement
    type 0 "audio" /^(?:audio|text)$/
    var 1 "" `/^(?:meta:.+
    op 2 "==" `/^(?:<
    val 3 "" /^.*$/
  • Node: trace
    Purpose: data flow tracing
    Example: trace(type: "audio")

    Port Payload
    input text, audio
    output text, audio
    Parameter Position Default Requirement
    type 0 "audio" /^(?:audio|text)$/
    name 1 none none

Graph Expression Language

The SpeechFlow graph expression language is based on FlowLink, which itself has a language following the following BNF-style grammar:

expr             ::= parallel
                   | sequential
                   | node
                   | group
parallel         ::= sequential ("," sequential)+
sequential       ::= node ("|" node)+
node             ::= id ("(" (param ("," param)*)? ")")?
param            ::= array | object | variable | template | string | number | value
group            ::= "{" expr "}"
id               ::= /[a-zA-Z_][a-zA-Z0-9_-]*/
variable         ::= id
array            ::= "[" (param ("," param)*)? "]"
object           ::= "{" (id ":" param ("," id ":" param)*)? "}"
template         ::= "`" ("${" variable "}" / ("\\`"|.))* "`"
string           ::= /"(\\"|.)*"/
                   | /'(\\'|.)*'/
number           ::= /[+-]?/ number-value
number-value     ::= "0b" /[01]+/
                   | "0o" /[0-7]+/
                   | "0x" /[0-9a-fA-F]+/
                   | /[0-9]*\.[0-9]+([eE][+-]?[0-9]+)?/
                   | /[0-9]+/
value            ::= "true" | "false" | "null" | "NaN" | "undefined"

SpeechFlow makes available to FlowLink all SpeechFlow nodes as node, the CLI arguments under the array variable named argv, and all environment variables under the object variable named env.

History

Speechflow, as a technical cut-through, was initially created in March 2024 for use in the msg Filmstudio context. It was later refined into a more complete toolkit in April 2025 and this way the first time could be used in production. It was fully refactored in July 2025 in order to support timestamps in the streams processing.

Copyright © 2024-2025 Dr. Ralf S. Engelschall
Licensed under GPL 3.0