JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 336
  • Score
    100M100P100Q93913F
  • License MIT

Browser-native audio transcription powered by WebGPU Whisper — zero server, fully local.

Package Exports

  • browser-whisper

Readme

browser-whisper

npm version license bundle size

Browser-native audio transcription powered by WebCodecs + WebGPU. No server. No API keys.

browser-whisper runs OpenAI's Whisper model entirely in the browser. It uses WebCodecs to decode audio from any file format with hardware acceleration, and WebGPU to run ONNX inference — falling back to WASM automatically when either is unavailable.

Live Demo → · Vite example → · Next.js example →


Features

  • WebGPU inference — runs ONNX Whisper models on the GPU via @huggingface/transformers; falls back to WASM automatically
  • WebCodecs audio decoding — hardware-accelerated decode of any audio/video format via mediabunny; falls back to AudioContext on older browsers
  • Concurrent pipeline — model loading and audio decoding run in parallel across two Web Workers
  • Zero-copy PCM transfer — audio frames move from the decoder worker to the inference worker via MessageChannel with ArrayBuffer transfer, no copying
  • Streaming API — results are yielded as an async iterator, segment by segment
  • Model caching — weights are cached in the browser Cache API after the first download
  • TypeScript-first — full type definitions included

How it works

File
 │
 ▼
[Decoder Worker]
  mediabunny (demux) → WebCodecs AudioDecoder → mono 16 kHz PCM → 30 s chunks
         │                        (fallback: AudioContext.decodeAudioData)
         │  MessageChannel (zero-copy ArrayBuffer transfer)
         ▼
[Whisper Worker]
  @huggingface/transformers → ONNX Runtime → WebGPU  ──► TranscriptSegments
                                                (fallback: WASM)
         │
         ▼
    Main Thread (async iterator / callbacks)

Both workers are started concurrently: the Whisper worker begins downloading and compiling the model while the decoder worker demuxes and decodes the audio file. Chunks queued before the model is ready are buffered and processed in order.


Install

npm install browser-whisper
# or
bun add browser-whisper

Peer dependencies: none. @huggingface/transformers and mediabunny are bundled into the library's worker blobs at build time and resolved automatically at runtime.


Setup

The ONNX inference engine uses SharedArrayBuffer for threading, which requires two HTTP headers on every page that loads the library:

Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin

Vite

// vite.config.ts
export default defineConfig({
  server: {
    headers: {
      'Cross-Origin-Embedder-Policy': 'require-corp',
      'Cross-Origin-Opener-Policy': 'same-origin',
    },
  },
  preview: {
    headers: {
      'Cross-Origin-Embedder-Policy': 'require-corp',
      'Cross-Origin-Opener-Policy': 'same-origin',
    },
  },
})

Next.js

See NEXTJS.md for the full guide, including SSR-safe dynamic imports and header configuration.


Quick start

import { BrowserWhisper } from 'browser-whisper'

const whisper = new BrowserWhisper()

// file from <input type="file"> or drag-and-drop
const file = event.target.files[0]

// Stream segments as they arrive
for await (const segment of whisper.transcribe(file)) {
  console.log(`[${segment.start.toFixed(1)}s] ${segment.text}`)
}

Collect all segments at once

const segments = await whisper.transcribe(file).collect()
console.log(segments.map(s => s.text).join(' '))

With callbacks and options

const whisper = new BrowserWhisper({
  model: 'whisper-small',
  language: 'en',
})

whisper.transcribe(file, {
  onSegment: (seg) => appendToUI(seg),
  onProgress: (evt) => {
    console.log(evt.stage)    // 'loading' | 'decoding' | 'transcribing'
    console.log(evt.progress) // 0 – 1
  },
})

API

new BrowserWhisper(options?)

Creates a reusable transcriber instance. The loaded model is cached in the worker between calls.

Option Type Default Description
model WhisperModel 'whisper-base' Which Whisper model to use
language string auto-detect BCP-47 language code, e.g. 'en', 'fr', 'ja'
quantization QuantizationType 'hybrid' Model precision

whisper.transcribe(file, options?)

Returns a TranscribeStream. Options passed here override constructor options for this call only.

Option Type Description
model WhisperModel Override model for this file
language string Override language for this file
quantization QuantizationType Override quantization for this file
onSegment (seg: TranscriptSegment) => void Called for each transcribed segment
onProgress (evt: TranscribeProgress) => void Called with stage and 0–1 progress

TranscribeStream

Returned by whisper.transcribe(). Implements the async iterator protocol and has one helper:

const segments = await stream.collect() // resolves with TranscriptSegment[]

WhisperModel

Sizes below are for the default 'hybrid' quantization (encoder fp32 + decoder q4).

Value Download size Notes
'whisper-tiny' ~64 MB Fastest
'whisper-base' ~136 MB Default
'whisper-small' ~510 MB Better accuracy
'whisper-large' ~3 GB whisper-large-v3-turbo; best accuracy

Other quantizations will differ. Models are downloaded from Hugging Face Hub (onnx-community namespace) and cached in the browser after the first run.

QuantizationType

Value Description
'hybrid' Encoder fp32 + decoder q4 — default, best speed/accuracy balance
'fp32' Full precision
'fp16' Half precision
'q8' 8-bit quantized
'q4' 4-bit quantized

TranscriptSegment

interface TranscriptSegment {
  text: string
  start: number  // seconds from start of file
  end: number    // seconds from start of file
}

TranscribeProgress

interface TranscribeProgress {
  stage: 'loading' | 'decoding' | 'transcribing'
  progress: number  // 0 – 1
}

Errors

All errors extend BrowserWhisperError.

Class When thrown
WebCodecsNotSupportedError AudioDecoder is unavailable and the AudioContext fallback also fails
CodecNotSupportedError The file's audio codec is not decodable in this browser
NoAudioTrackError The file has no audio track
ModelLoadError The Whisper model failed to download or initialise
DecoderError The WebCodecs AudioDecoder emitted a fatal error

Browser support

WebGPU and WebCodecs are the primary paths. Both have automatic fallbacks so the library works on a broader range of browsers.

Browser WebGPU inference WASM inference fallback
Chrome 113+ 94+
Firefox 141+ 130+
Safari 18+ 16.4+
Browser WebCodecs decoding AudioContext fallback
Chrome 94+ all
Firefox 130+ all
Safari 16.4+ all

The library detects both features at runtime and falls back silently — no configuration needed.

Network required on first run: WASM binaries (~1 MB) are loaded from jsDelivr CDN, and model weights (64 MB – 3 GB depending on model) are streamed from Hugging Face Hub. Both are cached in the browser after the first run; subsequent calls work offline.


Contributing

Contributions are welcome. Please open an issue before submitting a large PR so we can discuss the approach.

# Install dependencies
bun install

# Start dev server (runs the demo app)
bun run dev:site

# Type-check
bun run typecheck

# Build the library
bun run build

The library is built with Vite. Workers are bundled as self-contained inline blobs using the ?worker&inline query — see vite.config.ts for details.


License

MIT — Tanpreet Singh Jolly