Package Exports

@localmode/transformers

Readme

@localmode/transformers

HuggingFace Transformers.js provider for LocalMode — run ML models locally in the browser.

Features

Browser-Native - Run ML models directly in the browser with WebGPU/WASM
Privacy-First - All processing happens locally, no data leaves the device
Model Caching - Models are cached in IndexedDB for instant subsequent loads
Optimized - Uses quantized models for smaller size and faster inference

Installation

pnpm install @localmode/transformers @localmode/core

Dependencies

Package	Purpose
`@huggingface/transformers` (^4.2.0)	ML inference via ONNX Runtime (WebGPU/WASM)
`phonemizer`	eSpeak-NG WASM for Kokoro TTS text-to-phoneme conversion

Overview

@localmode/transformers provides model implementations for the interfaces defined in @localmode/core. It wraps HuggingFace Transformers.js to enable local ML inference in the browser.

Provider API

All models are created via the transformers provider object. Each factory method returns a model implementing a @localmode/core interface.

Embeddings — Docs

import { embed, embedMany } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');

const { embedding } = await embed({ model: embeddingModel, value: 'Hello world' });
const { embeddings } = await embedMany({ model: embeddingModel, values: ['Hello', 'World'] });

Method	Interface	Description
`transformers.embedding(modelId)`	`EmbeddingModel`	Text embeddings

Recommended Models:

Xenova/all-MiniLM-L6-v2 - Fast, general-purpose (~22MB)
Xenova/paraphrase-multilingual-MiniLM-L12-v2 - 50+ languages

Multimodal Embeddings (CLIP/SigLIP) — Docs

Embed both text and images into the same vector space for cross-modal search.

import { embed, embedImage, cosineSimilarity } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');

// Text embedding
const { embedding: textVec } = await embed({ model, value: 'a photo of a cat' });

// Image embedding (same vector space)
const { embedding: imgVec } = await embedImage({ model, image: catImageBlob });

// Cross-modal similarity
const similarity = cosineSimilarity(textVec, imgVec);

Method	Interface	Description
`transformers.multimodalEmbedding(modelId)`	`MultimodalEmbeddingModel`	Text + image embeddings

Recommended Models:

Xenova/clip-vit-base-patch32 - Fast, 512 dimensions
Xenova/clip-vit-base-patch16 - Better accuracy, 512 dimensions

Reranking — Docs

import { rerank } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');

const { results } = await rerank({
  model: rerankerModel,
  query: 'What is machine learning?',
  documents: ['ML is a subset of AI...', 'Python is a language...'],
  topK: 5,
});

Method	Interface	Description
`transformers.reranker(modelId)`	`RerankerModel`	Document reranking

Classification & NLP — Docs

import { classify, extractEntities } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const sentiment = await classify({
  model: transformers.classifier('Xenova/distilbert-base-uncased-finetuned-sst-2-english'),
  text: 'I love this product!',
});

const entities = await extractEntities({
  model: transformers.ner('Xenova/bert-base-NER'),
  text: 'John works at Microsoft in Seattle',
});

Method	Interface	Description
`transformers.classifier(modelId)`	`ClassificationModel`	Text classification
`transformers.zeroShot(modelId)`	`ZeroShotClassificationModel`	Zero-shot text classification
`transformers.ner(modelId)`	`NERModel`	Named Entity Recognition

Translation & Summarization

Method	Interface	Description	Docs
`transformers.translator(modelId)`	`TranslationModel`	Text translation	Docs
`transformers.summarizer(modelId)`	`SummarizationModel`	Text summarization	Docs
`transformers.fillMask(modelId)`	`FillMaskModel`	Masked token prediction	Docs
`transformers.questionAnswering(modelId)`	`QuestionAnsweringModel`	Extractive QA	Docs

Audio

import { transcribe, synthesizeSpeech } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const transcription = await transcribe({
  model: transformers.speechToText('onnx-community/moonshine-tiny-ONNX'),
  audio: audioBlob,
  returnTimestamps: true,
});

const { audio, sampleRate } = await synthesizeSpeech({
  model: transformers.textToSpeech('onnx-community/Kokoro-82M-v1.0-ONNX'),
  text: 'Hello, how are you?',
  voice: 'af_heart', // 29 English voices (see KOKORO_VOICES)
  speed: 1.0,        // 0.5 – 2.0
});

Method	Interface	Description	Docs
`transformers.speechToText(modelId)`	`SpeechToTextModel`	Speech-to-text transcription	Docs
`transformers.textToSpeech(modelId)`	`TextToSpeechModel`	Text-to-speech synthesis	Docs
`transformers.audioClassifier(modelId)`	`AudioClassificationModel`	Audio classification
`transformers.zeroShotAudioClassifier(modelId)`	`ZeroShotAudioClassificationModel`	Zero-shot audio classification
`transformers.vad(modelId)`	`VADProvider`	Voice Activity Detection (Silero)

Voice Activity Detection (VAD)

Detect speech segments in real-time audio streams. Used with createLiveTranscriber() for open-mic and push-to-talk transcription.

import { createLiveTranscriber } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const vad = transformers.vad('onnx-community/silero-vad');
const transcriber = await createLiveTranscriber({
  model: transformers.speechToText('onnx-community/moonshine-tiny-ONNX'),
  mode: 'open-mic',
  vad,
});

Method	Interface	Description
`transformers.vad(modelId)`	`VADProvider`	Voice Activity Detection (Silero VAD)

Recommended Models:

Model	Description
`onnx-community/silero-vad`	Silero VAD v5 — recommended browser VAD (~1.8MB, 512-sample frames at 16 kHz)

Options: threshold (speech probability, default 0.5), silenceTimeoutMs (end-of-utterance timeout, default 700).

Vision

import { classifyImage, captionImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const classification = await classifyImage({
  model: transformers.imageClassifier('Xenova/vit-base-patch16-224'),
  image: imageBlob,
});

const caption = await captionImage({
  model: transformers.captioner('onnx-community/Florence-2-base-ft'),
  image: imageBlob,
});

Method	Interface	Description	Docs
`transformers.imageClassifier(modelId)`	`ImageClassificationModel`	Image classification	Docs
`transformers.zeroShotImageClassifier(modelId)`	`ZeroShotImageClassificationModel`	Zero-shot image classification	Docs
`transformers.captioner(modelId)`	`ImageCaptionModel`	Image captioning	Docs
`transformers.segmenter(modelId)`	`SegmentationModel`	Image segmentation	Docs
`transformers.objectDetector(modelId)`	`ObjectDetectionModel`	Object detection	Docs
`transformers.imageFeatures(modelId)`	`ImageFeatureModel`	Image feature extraction	Docs
`transformers.imageToImage(modelId)`	`ImageToImageModel`	Image super resolution	Docs
`transformers.depthEstimator(modelId)`	`DepthEstimationModel`	Monocular depth estimation

OCR & Document QA

Method	Interface	Description	Docs
`transformers.ocr(modelId)`	`OCRModel`	OCR (TrOCR, GLM-OCR, LightOnOCR-2)	Docs
`transformers.documentQA(modelId)`	`DocumentQAModel`	Document/Table question answering	Docs

Text Generation — Docs

Run ONNX-format language models in the browser with WebGPU acceleration:

import { generateText, streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

// Single-shot generation
const { text } = await generateText({ model, prompt: 'What is 2+2?' });

// Streaming generation
const result = await streamText({ model, prompt: 'Write a haiku' });
for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

Method	Interface	Description
`transformers.languageModel(modelId)`	`LanguageModel`	Text generation (ONNX, WebGPU/WASM)

Recommended ONNX LLMs (16 curated models):

Model	Size	Context	Vision
`onnx-community/granite-4.0-350m-ONNX-web`	~120MB	4K	No
`onnx-community/Qwen3-0.6B-ONNX`	~570MB	4K	No
`onnx-community/Qwen3.5-0.8B-ONNX`	~500MB	32K	Yes
`onnx-community/granite-4.0-1b-ONNX-web`	~350MB	4K	No
`onnx-community/Llama-3.2-1B-Instruct-ONNX`	~380MB	8K	No
`onnx-community/TinyLlama-1.1B-Chat-v1.0-ONNX`	~350MB	2K	No
`onnx-community/Qwen2.5-Coder-1.5B-Instruct`	~450MB	4K	No
`onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX`	~500MB	4K	No
`onnx-community/Llama-3.2-3B-Instruct-ONNX`	~900MB	8K	No
`onnx-community/Qwen3-4B-ONNX`	~1.2GB	4K	No
`microsoft/Phi-3-mini-4k-instruct-onnx-web`	~1.2GB	4K	No
`onnx-community/Qwen3.5-2B-ONNX`	~1.5GB	32K	Yes
`onnx-community/gemma-4-E2B-it-ONNX`	~1.5GB	128K	Yes
`onnx-community/Phi-4-mini-instruct-web-q4f16`	~2.3GB	4K	No
`onnx-community/Qwen3.5-4B-ONNX`	~2.5GB	32K	Yes
`onnx-community/gemma-4-E4B-it-ONNX`	~3GB	128K	Yes

Vision support: Qwen3.5, Qwen2.5-VL, Qwen3-VL, and Gemma 4 models support image input via their built-in vision encoder. Check model.supportsVision for feature detection. See Vision docs for usage.

Model Utilities

import { preloadModel, isModelCached, getModelStorageUsage } from '@localmode/transformers';

const cached = await isModelCached('Xenova/bge-small-en-v1.5');

await preloadModel('Xenova/bge-small-en-v1.5', {
  onProgress: (p) => console.log(`${p.progress}% loaded`),
});

const usage = await getModelStorageUsage();

Recommended Models

Embeddings

Model	Description
`Xenova/bge-small-en-v1.5`	Fast, general-purpose (~22MB, 384d)
`Xenova/paraphrase-multilingual-MiniLM-L12-v2`	50+ languages (~120MB, 384d)
`Xenova/all-mpnet-base-v2`	Higher quality (~420MB, 768d)
`Snowflake/snowflake-arctic-embed-xs`	Tiny retrieval embeddings (~23MB, 384d)

Reranking

Model	Description
`Xenova/ms-marco-MiniLM-L-6-v2`	Fast, small (~23MB, recommended)

Text Classification

Model	Description
`Xenova/distilbert-base-uncased-finetuned-sst-2-english`	Sentiment analysis
`Xenova/twitter-roberta-base-sentiment-latest`	Twitter sentiment

Zero-Shot Classification

Model	Description
`Xenova/mobilebert-uncased-mnli`	Fast, mobile-friendly (~21MB)
`Xenova/nli-deberta-v3-xsmall`	Mid-tier accuracy (~90MB)

Named Entity Recognition

Model	Description
`Xenova/bert-base-NER`	Standard NER (PER, ORG, LOC, MISC)

Translation

Model	Description
`Xenova/opus-mt-en-de`	English to German
`Xenova/opus-mt-en-fr`	English to French
`Xenova/opus-mt-en-es`	English to Spanish

Summarization

Model	Description
`Xenova/distilbart-cnn-6-6`	Best quality browser summarizer (~284MB)

Fill-Mask

Model	Description
`onnx-community/ModernBERT-base-ONNX`	General purpose (mask: `[MASK]`)

Question Answering

Model	Description
`Xenova/distilbert-base-cased-distilled-squad`	SQuAD trained (~65MB)

Speech-to-Text

Model	Description
`onnx-community/moonshine-tiny-ONNX`	Fast, edge-optimized (~50MB)
`onnx-community/moonshine-base-ONNX`	Best quality/size ratio (~237MB)

Text-to-Speech

Model	Description
`onnx-community/Kokoro-82M-v1.0-ONNX`	Natural speech, 29 English voices (~86MB)

Image Classification

Model	Description
`Xenova/vit-base-patch16-224`	General image classification
`Xenova/siglip-base-patch16-224`	Zero-shot image classification (~400MB)

Image Captioning

Model	Description
`onnx-community/Florence-2-base-ft`	High-quality captions (~223MB)

Image Segmentation

Model	Description
`Xenova/segformer-b0-finetuned-ade-512-512`	Semantic segmentation (ADE20K)

Object Detection

Model	Description
`onnx-community/dfine_n_coco-ONNX`	State-of-the-art, tiny (~4.5MB)
`Xenova/detr-resnet-50`	Classic transformer-based detection

Image Features

Model	Description
`Xenova/siglip-base-patch16-224`	Image embeddings (768d)
`onnx-community/dinov2-base-ONNX`	Self-supervised features

Image Super Resolution

Model	Description
`Xenova/swin2SR-lightweight-x2-64`	2x upscale, fast
`Xenova/swin2SR-classical-sr-x4-64`	4x upscale

OCR

Model	Description
`Xenova/trocr-small-printed`	Printed text, line-level (~120MB)
`Xenova/trocr-small-handwritten`	Handwritten text, line-level (~120MB)
`onnx-community/GLM-OCR-ONNX`	Document-level OCR with table/formula recognition (~652MB)
`onnx-community/LightOnOCR-2-1B-ONNX`	Fast document OCR, 11 languages (~700MB)

Document QA

Model	Description
`onnx-community/Florence-2-base-ft`	Document QA (~223MB)
`Xenova/donut-base-finetuned-docvqa`	Donut (~218MB)

Model Constants

All recommended models are exported as constants for easy reference:

import {
  MODELS,                      // All models organized by task
  EMBEDDING_MODELS,
  CLASSIFICATION_MODELS,
  ZERO_SHOT_MODELS,
  NER_MODELS,
  RERANKER_MODELS,
  SPEECH_TO_TEXT_MODELS,
  TEXT_TO_SPEECH_MODELS,
  IMAGE_CLASSIFICATION_MODELS,
  ZERO_SHOT_IMAGE_MODELS,
  IMAGE_CAPTION_MODELS,
  TRANSLATION_MODELS,
  SUMMARIZATION_MODELS,
  FILL_MASK_MODELS,
  QUESTION_ANSWERING_MODELS,
  OBJECT_DETECTION_MODELS,
  SEGMENTATION_MODELS,
  OCR_MODELS,
  DOCUMENT_QA_MODELS,
  IMAGE_TO_IMAGE_MODELS,
  IMAGE_FEATURE_MODELS,
  VAD_MODELS,
  TRANSFORMERS_LLM_MODELS,
  MULTIMODAL_EMBEDDING_MODELS,
  KOKORO_LANG_MAP,
} from '@localmode/transformers';

// Use with provider
const model = transformers.embedding(EMBEDDING_MODELS.BGE_SMALL_EN);

Kokoro Voice Catalog

The KOKORO_VOICES export provides a catalog of 29 English voices with metadata for UI display:

import { KOKORO_VOICES, KOKORO_DEFAULT_VOICE } from '@localmode/transformers';
import type { KokoroVoice } from '@localmode/transformers';

// Each voice has: id, name, language, languageLabel, gender
const english = KOKORO_VOICES.filter((v) => v.language === 'en-US');
const females = KOKORO_VOICES.filter((v) => v.gender === 'female');

console.log(KOKORO_DEFAULT_VOICE); // 'af_heart'

Languages: American English, British English.

Advanced Usage

Custom Model Options

const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
  quantized: true, // Use quantized model (smaller, faster)
  device: 'webgpu', // Use WebGPU for acceleration (falls back to WASM)
});

Language Model Options

Language models accept additional settings via LanguageModelSettings:

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
  contextLength: 32768,
  maxTokens: 1024,
  temperature: 0.7,
  device: 'webgpu',
  // dtype accepts a string or a per-component config object
  dtype: 'q4f16',
  // For multimodal models, use per-component dtype:
  // dtype: { embed_tokens: 'q4', vision_encoder: 'q4', decoder_model_merged: 'q4' },
});

Provider Options

Pass provider-specific options to core functions:

const { embedding } = await embed({
  model: transformers.embedding('Xenova/bge-small-en-v1.5'),
  value: 'Hello world',
  providerOptions: {
    transformers: {
      // Any Transformers.js specific options
    },
  },
});

Preloading Models

For better UX, preload models before use:

import { preloadModel, isModelCached } from '@localmode/transformers';
import { embed } from '@localmode/core';

if (!(await isModelCached('Xenova/bge-small-en-v1.5'))) {
  await preloadModel('Xenova/bge-small-en-v1.5', {
    onProgress: (p) => console.log(`Loading: ${p.progress}%`),
  });
}

// Subsequent calls are instant (loaded from cache)
const embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
const { embedding } = await embed({ model: embeddingModel, value: 'Hello' });

Exported Implementation Classes

For advanced use cases, implementation classes are available:

import {
  TransformersEmbeddingModel,
  TransformersClassificationModel,
  TransformersZeroShotModel,
  TransformersNERModel,
  TransformersRerankerModel,
  TransformersSpeechToTextModel,
  TransformersImageClassificationModel,
  TransformersZeroShotImageModel,
  TransformersCaptionModel,
  TransformersCLIPEmbeddingModel,
  TransformersLanguageModel,
  TransformersGenerativeOCRModel,
  isGenerativeOCRModel,
  TransformersSileroVAD,
  createSileroVAD,
} from '@localmode/transformers';

Browser Compatibility

Browser	WebGPU	WASM	Notes
Chrome 113+	✅	✅	Best performance with WebGPU
Edge 113+	✅	✅	Same as Chrome
Firefox	❌	✅	WASM only
Safari 26+	✅	✅	WebGPU available
iOS Safari	✅	✅	WebGPU available (iOS 26+)

Performance Tips

Use quantized models - Smaller and faster with minimal quality loss
Preload models - Load during app init for instant inference
Use WebGPU when available - 3-5x faster than WASM
Batch operations - Process multiple inputs together

Acknowledgments

This package is built on Transformers.js by HuggingFace — state-of-the-art ML models running in the browser via ONNX Runtime.

License

MIT