JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 1649
  • Score
    100M100P100Q103273F
  • License MIT

A node.js embedding tool with optional GPU acceleration

Package Exports

  • @jsilvanus/embedeer

Readme

embedeer

Embedeer Logo: a deer with vector numbers between antlers. Logo generated by ChatGPT. Public Domain.

A Node.js Embedding Tool

npm version npm downloads release downloads

A Node.js tool for generating text embeddings using transformers.js with ONXX models from Hugging Face.

Supports batched input, parallel execution, isolated child-process workers (default) or in-process threads, quantization, optional GPU acceleration, and Hugging Face auth.


Features

  • Downloads any Hugging Face feature-extraction model on first use (cached in ~/.embedeer/models)
  • Isolated processes (default) — a worker crash cannot bring down the caller
  • In-process threads — opt-in via mode: 'thread' for lower overhead
  • Sequential execution when concurrency: 1
  • Configurable batch size and concurrency
  • GPU acceleration — optional CUDA (Linux x64) and DirectML (Windows x64), no extra packages needed
  • Hugging Face API token support (--token / HF_TOKEN env var)
  • Quantization via dtype (fp32 · fp16 · q8 · q4 · q4f16 · auto)
  • Rich CLI: pull model, embed from file, dump output as JSON / TXT / SQL

Installation

npm install @jsilvanus/embedeer

GPU acceleration (CUDA on Linux x64, DirectML on Windows x64) is built into onnxruntime-node which ships as a transitive dependency. No additional packages are required.

For CUDA on Linux x64 you also need the CUDA 12 system libraries:

# Ubuntu / Debian
sudo apt install cuda-toolkit-12-6 libcudnn9-cuda-12
## Programmatic API

Model management

Embedeer supports pre-caching and managing downloaded models.

  • Pull (pre-cache) a model via the CLI:
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2
  • Programmatic pre-cache using loadModel():
import { loadModel } from '@jsilvanus/embedeer';

const { modelName, cacheDir } = await loadModel('Xenova/all-MiniLM-L6-v2', {
  token: 'hf_...',    // optional HF token
  dtype: 'q8',        // optional quantization
  cacheDir: '/my/cache', // optional override
});
  • Cache location: default is ~/.embedeer/models. Override with the CLI --cache-dir option or the cacheDir argument to loadModel().

  • Removing cached models: delete the model directory from the cache. Example:

# Unix
rm -rf ~/.embedeer/models/Xenova-all-MiniLM-L6-v2

# PowerShell (Windows)
Remove-Item -Recurse -Force $env:USERPROFILE\.embedeer\models\Xenova-all-MiniLM-L6-v2
  • Advanced: see src/model-management.js for low-level cache helpers.

Model compatibility (ONNX)

Embedeer runs models via onnxruntime-node. Models chosen from Hugging Face must provide an ONNX export compatible with ONNX Runtime, or be convertible to ONNX (see Optimum). If a model does not include an ONNX build, export it and place the ONNX files in your cache or publish them to the model repository so embedeer can load them.

Programmatic runtime & cache helpers

Two small runtime/cache helpers are available from the public API:

  • getLoadedModels() — returns an array of model names currently loaded by active worker pools.
  • deleteModel(modelName, { cacheDir? }) — remove cached model directories matching modelName.

Example:

import { getLoadedModels, deleteModel } from '@jsilvanus/embedeer';

// Synchronous list of models currently loaded by any running WorkerPool
console.log(getLoadedModels()); // e.g. ['Xenova/all-MiniLM-L6-v2']

// Remove a cached model from disk (async)
const removed = await deleteModel('Xenova/all-MiniLM-L6-v2');
console.log('removed?', removed);

Explainer — deterministic LLM interface

This was deprecated and moved to npm package @jsilvanus/chattydeer in 1.3.0.

Input Sources

Embed texts (CPU — default)

import { Embedder } from '@jsilvanus/embedeer';

const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  batchSize:   32,          // texts per worker task   (default: 32)
  concurrency: 2,           // parallel workers        (default: 2)
  mode:       'process',    // 'process' | 'thread'    (default: 'process')
  pooling:    'mean',       // 'mean' | 'cls' | 'none' (default: 'mean')
  normalize:   true,        // L2-normalise vectors    (default: true)
  token:      'hf_...',     // HF API token (optional; also reads HF_TOKEN env)
  dtype:      'q8',         // quantization dtype      (optional)
  cacheDir:   '/my/cache',  // override model cache    (default: ~/.embedeer/models)
});

const vectors = await embedder.embed(['Hello world', 'Foo bar baz']);
// → number[][]  (one 384-dim vector per text for all-MiniLM-L6-v2)

await embedder.destroy(); // shut down worker processes

TypeScript example

The package includes TypeScript declarations so imports are typed automatically.

import { Embedder } from '@jsilvanus/embedeer';

async function main() {
  const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', { batchSize: 32, concurrency: 2 });
  const vectors = await embedder.embed(['Hello world', 'Foo bar baz']);
  // vectors: number[][]
  await embedder.destroy();
}

main().catch(console.error);

Programmatic profile generation (optional)

You can generate and save a per-user performance profile which Embedder.create() will automatically apply. This is useful to pick the best batchSize / concurrency for your machine without manual tuning.

import { Embedder } from '@jsilvanus/embedeer';

// Quick profile generation (writes ~/.embedeer/perf-profile.json)
await Embedder.generateAndSaveProfile({ mode: 'quick', device: 'cpu', sampleSize: 100 });
// Subsequent calls to Embedder.create() will auto-apply the saved profile by default.

Embed texts with GPU

import { Embedder } from '@jsilvanus/embedeer';

// Auto-detect GPU (falls back to CPU if no provider is installed)
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  device: 'auto',
});

// Require GPU (throws if no provider is available)
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  device: 'gpu',
});

// Explicitly select an execution provider
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  provider: 'cuda',  // 'cuda' | 'dml'
});

CLI

npx @jsilvanus/embedeer [options]

Model management (pull / cache model):
  npx @jsilvanus/embedeer --model <name>

Embed texts (batch):
  npx @jsilvanus/embedeer --model <name> --data "text1" "text2" ...
  npx @jsilvanus/embedeer --model <name> --data '["text1","text2"]'
  npx @jsilvanus/embedeer --model <name> --file texts.txt
  echo '["t1","t2"]' | npx @jsilvanus/embedeer --model <name>
  printf 'a\0b\0c' | npx @jsilvanus/embedeer --model <name> --delimiter '\0'

Interactive / streaming line-reader:
  npx @jsilvanus/embedeer --model <name> --interactive --dump out.jsonl
  cat big.txt | npx @jsilvanus/embedeer --model <name> -i --output csv --dump out.csv

Options:
  -m, --model <name>           Hugging Face model (default: Xenova/all-MiniLM-L6-v2)
  -d, --data <text...>         Text(s) or JSON array to embed
      --file <path>            Input file: JSON array or delimited texts
  -D, --delimiter <str>        Record separator for stdin/file (default: \n)
                               Escape sequences supported: \0 \n \t \r
  -i, --interactive            Interactive line-reader (see below)
      --dump <path>            Write output to file instead of stdout
      --output <format>        Output: json|jsonl|csv|txt|sql (default: json)
      --with-text              Include source text alongside each embedding
  -b, --batch-size <n>         Texts per worker batch (default: 32)
  -c, --concurrency <n>        Parallel workers (default: 2)
      --mode process|thread    Worker mode (default: process)
  -p, --pooling <mode>         mean|cls|none (default: mean)
      --no-normalize           Disable L2 normalisation
      --dtype <type>           Quantization: fp32|fp16|q8|q4|q4f16|auto
      --token <tok>            Hugging Face API token (or set HF_TOKEN env)
      --cache-dir <path>       Model cache directory (default: ~/.embedeer/models)
      --device <mode>          Compute device: auto|cpu|gpu (default: cpu)
      --provider <name>        Execution provider override: cpu|cuda|dml
  -h, --help                   Show this help

Input Sources

Texts can be provided in any of these ways (checked in order):

Source How
Inline args --data "text1" "text2" "text3"
Inline JSON --data '["text1","text2"]'
File --file texts.txt (JSON array or one record per line)
Stdin Pipe or redirect — auto-detected; TTY is skipped
Interactive --interactive / -i — line-reader, embeds as you type

Stdin auto-detection: when stdin is not a TTY (i.e. data is piped or redirected), embedeer reads it before deciding what to do. JSON arrays are accepted directly; otherwise records are split on the delimiter.


Interactive Line-Reader Mode (-i / --interactive)

The interactive mode opens a line-by-line reader that starts embedding as records arrive — ideal for pasting large datasets into a terminal or streaming data from another process.

# Open an interactive session (paste lines, Ctrl+D when done)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --interactive --dump embeddings.jsonl

# Stream a large file through interactive mode with CSV output
cat big.txt | npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 \
  --interactive --output csv --dump embeddings.csv

# Interactive with GPU, custom batch size, txt output
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 \
  --interactive --device auto --batch-size 16 --output txt --dump vecs.txt

How it works:

Event What happens
Type a line, press Enter Record is buffered
Buffer reaches --batch-size Auto-flush: embed + append to output
Type an empty line Manual flush: embed whatever is buffered
Ctrl+D (EOF) Flush remaining records and exit
Ctrl+C Flush remaining records and exit

Behaviour notes:

  • Progress messages (Batch N: M record(s) → file) always go to stderr — they never pollute piped output.
  • When stdin is a TTY, a > prompt is shown on stderr.
  • Output defaults to stdout if --dump is omitted; a tip is printed when running in TTY mode.
  • --output json and --output sql are automatically promoted to jsonl since they produce complete documents that cannot be appended to incrementally.
  • --output csv writes the dimension header (text,dim_0,dim_1,...) on the first batch only; subsequent batches append data rows.
  • Each interactive session clears the --dump file on start so you always get a fresh output file.

Configurable delimiter (-D / --delimiter)

By default records in stdin and files are split on newline (\n). Use --delimiter to change it:

# Newline-delimited (default)
printf 'Hello\nWorld\n' | npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2

# Null-byte delimited — safe with filenames/texts that contain newlines
printf 'Hello\0World\0' | npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --delimiter '\0'

# Tab-delimited
printf 'Hello\tWorld' | npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --delimiter '\t'

# Custom multi-character delimiter
printf 'Hello|||World|||Foo' | npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --delimiter '|||'

# File with null-byte delimiter
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --file records.bin --delimiter '\0'

# Integrate with find -print0 (handles filenames with spaces / newlines)
find ./docs -name '*.txt' -print0 | \
  xargs -0 cat | \
  npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --delimiter '\0'

Supported escape sequences in --delimiter:

Sequence Character
\0 Null byte (U+0000)
\n Newline (U+000A)
\t Tab (U+0009)
\r Carriage return (U+000D)

Output Formats

Format Description
json (default) JSON array of float arrays: [[0.1,0.2,...],[...]]
json --with-text JSON array of objects: [{"text":"...","embedding":[...]}]
jsonl Newline-delimited JSON, one object per line: {"text":"...","embedding":[...]}
csv CSV with header: text,dim_0,dim_1,...,dim_N
txt Space-separated floats, one vector per line
txt --with-text Tab-separated: <original text>\t<float float ...>
sql INSERT INTO embeddings (text, vector) VALUES ...;

Use --dump <path> to write the output to a file instead of stdout. Progress messages always go to stderr so they never interfere with piped output.

Piping examples

MODEL=Xenova/all-MiniLM-L6-v2

# --- json (default) ---
# Embed and pretty-print with jq
echo '["Hello","World"]' | npx @jsilvanus/embedeer --model $MODEL | jq '.[0] | length'

# --- jsonl ---
# One object per line — pipe to jq, grep, awk, etc.
npx @jsilvanus/embedeer --model $MODEL --data "foo" "bar" --output jsonl

# Filter by similarity: extract embedding for downstream processing
npx @jsilvanus/embedeer --model $MODEL --data "query text" --output jsonl \
  | jq -c '.embedding'

# Stream a large file and store as JSONL
npx @jsilvanus/embedeer --model $MODEL --file big.txt --output jsonl --dump out.jsonl

# --- json --with-text ---
# Keep the source text next to each vector (useful for building a search index)
npx @jsilvanus/embedeer --model $MODEL --output json --with-text \
  --data "cat" "dog" "fish" \
  | jq '.[] | {text, dims: (.embedding | length)}'

# --- csv ---
# Embed then open in Python/pandas
npx @jsilvanus/embedeer --model $MODEL --file texts.txt --output csv --dump vectors.csv
python3 -c "import pandas as pd; df = pd.read_csv('vectors.csv'); print(df.shape)"

# --- txt ---
# Raw floats — useful for awk/paste/numpy text loading
npx @jsilvanus/embedeer --model $MODEL --data "Hello" "World" --output txt \
  | awk '{print NF, "dimensions"}'

# txt --with-text: original text + tab + floats, easy to parse
npx @jsilvanus/embedeer --model $MODEL --file texts.txt --output txt --with-text \
  | while IFS=$'\t' read -r text vec; do echo "TEXT: $text"; done

# --- sql ---
# Generate INSERT statements for a vector DB or SQLite
npx @jsilvanus/embedeer --model $MODEL --file texts.txt --output sql --dump inserts.sql
sqlite3 mydb.sqlite < inserts.sql

# --- Chaining with other tools ---
# Embed stdin from another command
cat docs/*.txt | npx @jsilvanus/embedeer --model $MODEL --output jsonl > embeddings.jsonl

# Null-byte input from find (handles any filename or text with newlines)
find ./corpus -name '*.txt' -print0 \
  | xargs -0 cat \
  | npx @jsilvanus/embedeer --model $MODEL --delimiter '\0' --output jsonl

CLI Examples

# Pull a model (like ollama pull)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2

# Embed a few strings, output JSON (CPU)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --data "Hello" "World"

# Auto-detect GPU, fall back to CPU if unavailable
# (uses CUDA on Linux, DirectML on Windows, CPU everywhere else)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device auto --data "Hello"

# Require GPU (throws with install instructions if no provider found)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device gpu --data "Hello GPU"

# Explicit CUDA (Linux x64 — requires CUDA 12 system libraries)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --provider cuda --data "Hello CUDA"

# Explicit DirectML (Windows x64)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --provider dml --data "Hello DML"

# Embed from a file, dump SQL to disk
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 \
  --file texts.txt --output sql --dump out.sql

# Use quantized model, in-process threads, private model with token
npx @jsilvanus/embedeer --model my-org/private-model \
  --token hf_xxx --dtype q8 --mode thread \
  --data "embed me"

Using GPU

No additional packages are needed — onnxruntime-node (installed with @jsilvanus/embedeer) already bundles the CUDA provider on Linux x64 and DirectML on Windows x64.

Linux x64 — NVIDIA CUDA:

# One-time: install CUDA 12 system libraries (Ubuntu/Debian)
sudo apt install cuda-toolkit-12-6 libcudnn9-cuda-12

# Auto-detect: uses CUDA here, CPU fallback on any other machine
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device auto --data "Hello"

# Hard-require CUDA (throws with diagnostic error if unavailable):
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device gpu --data "Hello GPU"

# Explicit CUDA provider:
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --provider cuda --data "Hello CUDA"

Windows x64 — DirectML (any GPU: NVIDIA / AMD / Intel):

npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device auto --data "Hello"
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device gpu  --data "Hello GPU"
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --provider dml --data "Hello DML"

GPU Acceleration

GPU support is built into onnxruntime-node (a dependency of @huggingface/transformers):

Platform Provider Requirement
Linux x64 CUDA NVIDIA GPU + driver ≥ 525, CUDA 12 toolkit, cuDNN 9
Windows x64 DirectML Any DirectX 12 GPU (most GPUs since 2016), Windows 10+

Provider selection logic


Testing

Run the project's tests locally:

# install deps
pnpm install

# run tests
pnpm test

# run tests with coverage
pnpm run coverage

CI is enabled via GitHub Actions (.github/workflows/ci.yml) which runs tests and collects coverage on push and pull requests.

device provider Behavior
cpu (default) Always CPU
auto Try GPU providers for the platform in order; silent CPU fallback
gpu Try GPU providers; throw if none available
any cuda Load CUDA provider; throw if not available or not supported
any dml Load DirectML provider; throw if not available or not supported
any cpu Always CPU

On Linux x64: GPU order is cuda.
On Windows x64: GPU order is cuda → dml.


Performance Optimizations

Embedeer exposes runtime knobs and helper scripts to tune throughput for your host.

  • Pre-load models: run Embedder.loadModel(model, { dtype, cacheDir }) or use the bench scripts so workers start instantly without re-downloading models.
  • Reuse Embedder instances: create a single Embedder and call embed() repeatedly instead of creating and destroying instances per batch.
  • Batch size vs concurrency:
    • CPU: moderate batch sizes (16–64) with multiple workers (concurrency ≥ 2) usually give best throughput.
    • GPU: larger batches (64–256) with low concurrency (1–2) are typically fastest.
  • BLAS threading: avoid oversubscription by setting OMP_NUM_THREADS and MKL_NUM_THREADS to Math.floor(cpu_cores / concurrency) before starting workers.
  • Device/provider: use cuda on Linux and dml (DirectML) on Windows when available; device: 'auto' will try providers and fall back to CPU.
  • Automatic tuning: use bench/grid-search.js to sweep batchSize, concurrency, and dtype for your host and save results. You can generate and persist a per-user profile and apply it automatically via the Embedder APIs.

Examples:

# CPU quick grid
node bench/grid-search.js --device cpu --sample-size 200 --out bench/grid-results-cpu.json

# GPU quick grid
node bench/grid-search.js --device gpu --sample-size 100 --out bench/grid-results-gpu.json

Programmatic profile generation (writes ~/.embedeer/perf-profile.json):

import { Embedder } from '@jsilvanus/embedeer';

await Embedder.generateAndSaveProfile({ mode: 'quick', device: 'cpu', sampleSize: 100 });
// Embedder.create() will auto-apply a saved per-user profile by default

How it works

embed(texts)
  │
  ├─ split into batches of batchSize
  │
  └─ Promise.all(batches) ──► WorkerPool
                                 │
                                 ├─ [process mode] ChildProcessWorker 0
                                 │   resolveProvider(device, provider)
                                 │   → pipeline('feature-extraction', model, { device: 'cuda' })
                                 │   → embed batch A
                                 │
                                 └─ [process mode] ChildProcessWorker 1
                                     resolveProvider(device, provider)
                                     → pipeline(...) → embed batch B

Workers load the model once at startup and reuse it for all batches.
Provider activation happens per-worker before the pipeline is created.


E2E-testing

Note: HF authentication has not been tested.


Collaboration

You are welcome to suggest additions or open a PR, especially if you have performance-related assistance. Opened issues are also accepted with thanks.

License

MIT