JSPM

  • Created
  • Published
  • Downloads 100
  • Score
    100M100P100Q74575F
  • License GPL-3.0-only

Speech Processing Flow Graph

Package Exports

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (speechflow) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    SpeechFlow

    Speech Processing Flow Graph

    github (author stars) github (author followers) github (project stdver) github (project release)

    About

    SpeechFlow is a command-line interface based tool for macOS, Windows and Linux, establishing a directed data flow graph of audio and text processing nodes. This way, it allows to perform various speech processing tasks in a very flexible and configurable way. The usual supported tasks are capturing audio, generate narrations of text (aka text-to-speech), generate transcriptions or subtitles for audio (aka speech-to-text), and generate translations for audio (aka speech-to-speech).

    SpeechFlow comes with built-in graph nodes for local file I/O, local audio device I/O, remote WebSocket network I/O, remote MQTT network I/O, local Voice Activity Detection (VAD), local voice gender recognition, local audio LUFS-S/RMS metering, remote-controlable local audio muting, cloud-based Deepgram speech-to-text conversion, cloud-based ElevenLabs text-to-speech conversion, cloud-based DeepL text-to-text translation, cloud-based OpenAI/GPT text-to-text translation (or spelling correction), local Ollama/Gemma text-to-text translation (or spelling correction), local OPUS/ONNX text-to-text translation, local FFmpeg speech-to-speech encoding, local WAV speech-to-speech encoding, local text-to-text formatting, local text-to-text sentencing merging/splitting, local text-to-text subtitle generation, local text or audio filter, and local text or audio tracing.

    Additional, SpeechFlow graph nodes can be provided externally by NPM packages named speechflow-node-xxx which expose a class derived from the exported SpeechFlowNode class of the speechflow package.

    SpeechFlow is written in TypeScript and ships as an installable package for the Node Package Manager (NPM).

    Installation

    $ npm install -g speechflow

    Usage

    $ speechflow
      [-h|--help]
      [-V|--version]
      [-S|--status]
      [-v|--verbose <level>]
      [-a|--address <ip-address>]
      [-p|--port <tcp-port>]
      [-C|--cache <directory>]
      [-e|--expression <expression>]
      [-f|--file <file>]
      [-c|--config <id>@<yaml-config-file>]
      [<argument> [...]]

    Graph Expression Language

    The SpeechFlow graph expression language is based on FlowLink, which itself has a language following the following BNF-style grammar:

    expr             ::= parallel
                       | sequential
                       | node
                       | group
    parallel         ::= sequential ("," sequential)+
    sequential       ::= node ("|" node)+
    node             ::= id ("(" (param ("," param)*)? ")")?
    param            ::= array | object | variable | template | string | number | value
    group            ::= "{" expr "}"
    id               ::= /[a-zA-Z_][a-zA-Z0-9_-]*/
    variable         ::= id
    array            ::= "[" (param ("," param)*)? "]"
    object           ::= "{" (id ":" param ("," id ":" param)*)? "}"
    template         ::= "`" ("${" variable "}" / ("\\`"|.))* "`"
    string           ::= /"(\\"|.)*"/
                       | /'(\\'|.)*'/
    number           ::= /[+-]?/ number-value
    number-value     ::= "0b" /[01]+/
                       | "0o" /[0-7]+/
                       | "0x" /[0-9a-fA-F]+/
                       | /[0-9]*\.[0-9]+([eE][+-]?[0-9]+)?/
                       | /[0-9]+/
    value            ::= "true" | "false" | "null" | "NaN" | "undefined"

    SpeechFlow makes available to FlowLink all SpeechFlow nodes as node, the CLI arguments under the array variable named argv, and all environment variables under the object variable named env.

    Processing Graph Examples

    The following are examples of particular SpeechFlow processing graphs. They can also be found in the sample speechflow.yaml file.

    • Capturing: Capture audio from microphone device into WAV audio file:

      device(device: "wasapi:VoiceMeeter Out B1", mode: "r") |
          wav(mode: "encode") |
              file(path: "capture.wav", mode: "w", type: "audio")
    • Pass-Through: Pass-through audio from microphone device to speaker device and in parallel record it to WAV audio file:

      device(device: "wasapi:VoiceMeeter Out B1", mode: "r") | {
          wav(mode: "encode") |
              file(path: "capture.wav", mode: "w", type: "audio"),
          device(device: "wasapi:VoiceMeeter VAIO3 Input", mode: "w")
      }
    • Transcription: Generate text file with German transcription of MP3 audio file:

      file(path: argv.0, mode: "r", type: "audio") |
          ffmpeg(src: "mp3", dst: "pcm") |
              deepgram(language: "de", key: env.SPEECHFLOW_DEEPGRAM_KEY) |
                  format(width: 80) |
                      file(path: argv.1, mode: "w", type: "text")
    • Subtitling: Generate text file with German subtitles of MP3 audio file:

      file(path: argv.0, mode: "r", type: "audio") |
          ffmpeg(src: "mp3", dst: "pcm") |
              deepgram(language: "de", key: env.SPEECHFLOW_DEEPGRAM_KEY) |
                  subtitle(format: "vtt") |
                      file(path: argv.1, mode: "w", type: "text")
    • Speaking: Generate audio file with English voice for a text file:

      file(path: argv.0, mode: "r", type: "text") |
          kokoro(language: "en") |
              wav(mode: "encode") |
                  file(path: argv.1, mode: "w", type: "audio")
    • Ad-Hoc Translation: Ad-Hoc text translation from German to English via stdin/stdout:

      file(path: "-", mode: "r", type: "text") |
          deepl(src: "de", dst: "en") |
              file(path: "-", mode: "w", type: "text")
    • Studio Translation: Real-time studio translation from German to English, including the capturing of all involved inputs and outputs:

      device(device: "coreaudio:Elgato Wave:3", mode: "r") | {
          gender() | {
              meter(interval: 250) |
                  wav(mode: "encode") |
                      file(path: "program-de.wav", mode: "w", type: "audio"),
              deepgram(language: "de", key: env.SPEECHFLOW_DEEPGRAM_KEY) | {
                  sentence() | {
                      format(width: 80) |
                          file(path: "program-de.txt", mode: "w", type: "text"),
                      deepl(src: "de", dst: "en", key: env.SPEECHFLOW_DEEPL_KEY) | {
                          trace(name: "text", type: "text") | {
                              format(width: 80) |
                                  file(path: "program-en.txt", mode: "w", type: "text"),
                              subtitle(format: "srt") |
                                  file(path: "program-en.srt", mode: "w", type: "text"),
                              mqtt(url: "mqtt://10.1.0.10:1883",
                                  username: env.SPEECHFLOW_MQTT_USER,
                                  password: env.SPEECHFLOW_MQTT_PASS,
                                  topicWrite: "stream/studio/sender"),
                              {
                                  filter(name: "S2T-male", type: "text", var: "meta:gender", op: "==", val: "male") |
                                      elevenlabs(voice: "Mark", optimize: "latency", speed: 1.05, language: "en"),
                                  filter(name: "S2T-female", type: "text", var: "meta:gender", op: "==", val: "female") |
                                      elevenlabs(voice: "Brittney", optimize: "latency", speed: 1.05, language: "en")
                              } | {
                                  wav(mode: "encode") |
                                      file(path: "program-en.wav", mode: "w", type: "audio"),
                                  device(device: "coreaudio:USBAudio2.0", mode: "w")
                              }
                          }
                      }
                  }
              }
          }
      }

    Processing Node Types

    First a short overview of the available processing nodes:

    • Input/Output nodes: file, device, websocket, mqtt.
    • Audio-to-Audio nodes: ffmpeg, wav, mute, meter, vad, gender.
    • Audio-to-Text nodes: deepgram.
    • Text-to-Text nodes: deepl, openai, ollama, transformers, subtitle, format.
    • Text-to-Audio nodes: elevenlabs.
    • Any-to-Any nodes: filter, trace.

    Input/Output Nodes

    The following nodes are for external I/O, i.e, to read/write from external files, devices and network services.

    • Node: file
      Purpose: File and StdIO source/sink
      Example: file(path: "capture.pcm", mode: "w", type: "audio")

      This node allows the reading/writing from/to files or from StdIO. It is intended to be used as source and sink nodes in batch processing, and as sing nodes in real-time processing.

      Port Payload
      input text, audio
      output text, audio
      Parameter Position Default Requirement
      path 0 none none
      mode 1 "r" /^(?:r|w|rw)$/
      type 2 "audio" /^(?:audio|text)$/
      chunka 200 10 <= n <= 1000
      chunkt 65536 1024 <= n <= 131072
    • Node: device
      Purpose: Microphone/speaker device source/sink
      Example: device(device: "wasapi:VoiceMeeter Out B1", mode: "r")

      This node allows the reading/writing from/to audio devices. It is intended to be used as source nodes for microphone devices and as sink nodes for speaker devices.

      Port Payload
      input audio
      output audio
      Parameter Position Default Requirement
      device 0 none /^(.+?):(.+)$/
      mode 1 "rw" /^(?:r|w|rw)$/
      chunk 2 200 10 <= n <= 1000
    • Node: websocket
      Purpose: WebSocket source/sink
      Example: websocket(connect: "ws://127.0.0.1:12345", type: "text") Notice: this node requires a peer WebSocket service!

      This node allows reading/writing from/to WebSocket network services. It is primarily intended to be used for sending out the text of subtitles, but can be also used for receiving the text to be processed.

      Port Payload
      input text, audio
      output text, audio
      Parameter Position Default Requirement
      listen none none /^(?:|ws:\/\/(.+?):(\d+))$/
      connect none none /^(?:|ws:\/\/(.+?):(\d+)(?:\/.*)?)$/
      type none "audio" /^(?:audio|text)$/
    • Node: mqtt
      Purpose: MQTT sink
      Example: mqtt(url: "mqtt://127.0.0.1:1883", username: "foo", password: "bar", topic: "quux") Notice: this node requires a peer MQTT broker!

      This node allows reading/writing from/to MQTT broker topics. It is primarily intended to be used for sending out the text of subtitles, but can be also used for receiving the text to be processed.

      Port Payload
      input text
      output none
      Parameter Position Default Requirement
      url 0 none /^(?:|(?:ws|mqtt):\/\/(.+?):(\d+))$/
      username 1 none /^.+$/
      password 2 none /^.+$/
      topic 3 none /^.+$/

    Audio-to-Audio Nodes

    The following nodes process audio chunks only.

    • Node: ffmpeg
      Purpose: FFmpeg audio format conversion
      Example: ffmpeg(src: "pcm", dst: "mp3")

      This node allows converting between audio formats. It is primarily intended to support the reading/writing of external MP3 and Opus format files, although SpeechFlow internally uses PCM format only.

      Port Payload
      input audio
      output audio
      Parameter Position Default Requirement
      src 0 "pcm" /^(?:pcm|wav|mp3|opus)$/
      dst 1 "wav" /^(?:pcm|wav|mp3|opus)$/
    • Node: wav
      Purpose: WAV audio format conversion
      Example: wav(mode: "encode")

      This node allows converting between PCM and WAV audio formats. It is primarily intended to support the reading/writing of external WAV format files, although SpeechFlow internally uses PCM format only.

      Port Payload
      input audio
      output audio
      Parameter Position Default Requirement
      mode 0 "encode" /^(?:encode|decode)$/
    • Node: mute
      Purpose: volume muting node
      Example: mute() Notice: this node has to be externally controlled via REST/WebSockets!

      This node allows muting the audio stream by either silencing or even unplugging. It has to be externally controlled via REST/WebSocket (see below).

      Port Payload
      input audio
      output audio
      Parameter Position Default Requirement
    • Node: meter
      Purpose: Loudness metering node
      Example: meter(250)

      This node allows measuring the loudness of the audio stream. The results are emitted to both the logfile of SpeechFlow and the WebSockets API (see below).

      Port Payload
      input audio
      output audio
      Parameter Position Default Requirement
      interval 0 250 none
    • Node: vad
      Purpose: Voice Audio Detection (VAD) node
      Example: vad()

      This node perform Voice Audio Detection (VAD), i.e., it detects voice in the audio stream and if not detected either silences or unplugs the audio stream.

      Port Payload
      input audio
      output audio
      Parameter Position Default Requirement
      mode none "unplugged" /^(?:silenced|unplugged)$/
      posSpeechThreshold none 0.50 none
      negSpeechThreshold none 0.35 none
      minSpeechFrames none 2 none
      redemptionFrames none 12 none
      preSpeechPadFrames none 1 none
      postSpeechTail none 1500 none
    • Node: gender
      Purpose: Gender Detection node
      Example: gender()

      This node performs gender detection on the audio stream. It annotates the audio chunks with gender=male or gender=female meta information. Use this meta information with the "filter" node.

      Port Payload
      input audio
      output audio
      Parameter Position Default Requirement
      window 0 500 none

    Audio-to-Text Nodes

    The following nodes convert audio to text chunks.

    • Node: deepgram
      Purpose: Deepgram Speech-to-Text conversion
      Example: deepgram(language: "de")
      Notice: this node requires an API key!

      This node performs Speech-to-Text (S2T) conversion, i.e., it recognizes speech in the input audio stream and outputs a corresponding text stream.

      Port Payload
      input audio
      output text
      Parameter Position Default Requirement
      key none env.SPEECHFLOW_DEEPGRAM_KEY none
      keyAdm none env.SPEECHFLOW_DEEPGRAM_KEY_ADM none
      model 0 "nova-3" none
      version 1 "latest" none
      language 2 "multi" none

    Text-to-Text Nodes

    The following nodes process text chunks only.

    • Node: deepl
      Purpose: DeepL Text-to-Text translation
      Example: deepl(src: "de", dst: "en")
      Notice: this node requires an API key!

      This node performs translation between English and German languages.

      Port Payload
      input text
      output text
      Parameter Position Default Requirement
      key none env.SPEECHFLOW_DEEPL_KEY none
      src 0 "de" /^(?:de|en)$/
      dst 1 "en" /^(?:de|en)$/
    • Node: openai
      Purpose: OpenAI/GPT Text-to-Text translation and spelling correction
      Example: openai(src: "de", dst: "en")
      Notice: this node requires an OpenAI API key!

      This node performs translation between English and German languages in the text stream or (if the source and destination language is the same) spellchecking of English or German languages in the text stream. It is based on the remote OpenAI cloud AI service and uses the GPT-4o-mini LLM.

      Port Payload
      input text
      output text
      Parameter Position Default Requirement
      api none "https://api.openai.com" /^https?:\/\/.+?:\d+$/
      src 0 "de" /^(?:de|en)$/
      dst 1 "en" /^(?:de|en)$/
      key none env.SPEECHFLOW_OPENAI_KEY none
      model none "gpt-4o-mini" none
    • Node: ollama
      Purpose: Ollama/Gemma Text-to-Text translation and spelling correction
      Example: ollama(src: "de", dst: "en")
      Notice: this node requires Ollama to be installed!

      This node performs translation between English and German languages in the text stream or (if the source and destination language is the same) spellchecking of English or German languages in the text stream. It is based on the local Ollama AI service and uses the Google Gemma 3 LLM.

      Port Payload
      input text
      output text
      Parameter Position Default Requirement
      api none "http://127.0.0.1:11434" /^https?:\/\/.+?:\d+$/
      model none "gemma3:4b-it-q4_K_M" none
      src 0 "de" /^(?:de|en)$/
      dst 1 "en" /^(?:de|en)$/
    • Node: transformers
      Purpose: Transformers Text-to-Text translation
      Example: transformers(src: "de", dst: "en")

      This node performs translation between English and German languages in the text stream. It is based on local OPUS or SmolLM3 LLMs.

      Port Payload
      input text
      output text
      Parameter Position Default Requirement
      model none "OPUS" /^(?:OPUS|SmolLM3)$/
      src 0 "de" /^(?:de|en)$/
      dst 1 "en" /^(?:de|en)$/
    • Node: sentence
      Purpose: sentence splitting/merging
      Example: sentence()

      This node allows you to ensure that a text stream is split or merged into complete sentences. It is primarily intended to be used after the "deepgram" node and before "deepl" or "elevenlabs" nodes in order to improve overall quality.

      Port Payload
      input text
      output text
      Parameter Position Default Requirement
    • Node: subtitle
      Purpose: SRT/VTT Subtitle Generation
      Example: subtitle(format: "srt")

      This node generates subtitles from the text stream (and its embedded timestamps) in the formats SRT (SubRip) or VTT (WebVTT).

      Port Payload
      input text
      output text
      Parameter Position Default Requirement
      format none "srt" /^(?:srt|vtt)$/
      words none false none
    • Node: format
      Purpose: text paragraph formatting
      Example: format(width: 80)

      This node formats the text stream into lines no longer than a certain width. It is primarily intended for use before writing text chunks to files.

      Port Payload
      input text
      output text
      Parameter Position Default Requirement
      width 0 80 none

    Text-to-Audio Nodes

    The following nodes convert text chunks to audio chunks.

    • Node: elevenlabs
      Purpose: ElevenLabs Text-to-Speech conversion
      Example: elevenlabs(language: "en")
      Notice: this node requires an ElevenLabs API key!

      This node perform Text-to-Speech (T2S) conversion, i.e., it converts the input text stream into an output audio stream. It is intended to generate speech.

      Port Payload
      input text
      output audio
      Parameter Position Default Requirement
      key none env.SPEECHFLOW_ELEVENLABS_KEY none
      voice 0 "Brian" /^(?:Brittney|Cassidy|Leonie|Mark|Brian)$/
      language 1 "de" /^(?:de|en)$/
      speed 2 1.00 n >= 07 && n <= 1.2`
      stability 3 0.5 n >= 0.0 && n <= 1.0
      similarity 4 0.75 n >= 0.0 && n <= 1.0
      optimize 5 "latency" /^(?:latency|quality)$/
    • Node: kokoro
      Purpose: Kokoro Text-to-Speech conversion
      Example: kokoro(language: "en")
      Notice: this currently support English language only!

      This node perform Text-to-Speech (T2S) conversion, i.e., it converts the input text stream into an output audio stream. It is intended to generate speech.

      Port Payload
      input text
      output audio
      Parameter Position Default Requirement
      voice 0 "Aoede" /^(?:Aoede|Heart|Puck|Fenrir)$/
      language 1 "en" /^en$/
      speed 2 1.25 1.0...1.30

    Any-to-Any Nodes

    The following nodes process any type of chunk, i.e., both audio and text chunks.

    • Node: filter
      Purpose: meta information based filter
      Example: filter(type: "audio", var: "meta:gender", op: "==", val: "male")

      This node allows you to filter nodes based on certain criteria. It is primarily intended to be used in conjunction with the "gender" node and in front of the elevenlabs or kokoro nodes in order to translate with a corresponding voice.

      Port Payload
      input text, audio
      output text, audio
      Parameter Position Default Requirement
      type 0 "audio" /^(?:audio|text)$/
      name 1 "filter" /^.+$/
      var 2 "" /^(?:meta:.+|payload:(?:length|text)|time:(?:start|end))$/
      op 3 "==" /^(?:<|<=|==|!=|~~|!~|>=|>)$/
      val 4 "" /^.*$/
    • Node: trace
      Purpose: data flow tracing
      Example: trace(type: "audio")

      This node allows you to trace the audio and text chunk flow through the SpeechFlow graph. It just passes through its chunks, but sends information about the chunks to the log.

      Port Payload
      input text, audio
      output text, audio
      Parameter Position Default Requirement
      type 0 "audio" /^(?:audio|text)$/
      name 1 none none

    REST/WebSocket API

    SpeechFlow has an externally exposed REST/WebSockets API which can be used to control the nodes and to receive information from nodes. For controlling a node you have three possibilities (illustrated by controlling the mode of the "mute" node):

    # use HTTP/REST/GET:
    $ curl http://127.0.0.1:8484/api/COMMAND/mute/mode/silenced
    # use HTTP/REST/POST:
    $ curl -H "Content-type: application/json" \
      --data '{ "request": "COMMAND", "node": "mute", "args": [ "mode", "silenced" ] }' \
      http://127.0.0.1:8484/api
    # use WebSockets:
    $ wscat -c ws://127.0.0.1:8484/api \
    > { "request": "COMMAND", "node": "mute", "args": [ "mode", "silenced" ] }

    For receiving emitted information from nodes, you have to use the WebSockets API (illustrated by the emitted information of the "meter" node):

    # use WebSockets:
    $ wscat -c ws://127.0.0.1:8484/api \
    < { "response": "NOTIFY", "node": "meter", "args": [ "meter", "LUFS-S", -35.75127410888672 ] }

    History

    SpeechFlow, as a technical cut-through, was initially created in March 2024 for use in the msg Filmstudio context. It was later refined into a more complete toolkit in April 2025 and this way the first time could be used in production. It was fully refactored in July 2025 in order to support timestamps in the streams processing.

    Copyright © 2024-2025 Dr. Ralf S. Engelschall
    Licensed under GPL 3.0