Package Exports

phonemize
phonemize/dist/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (phonemize) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Phonemize

Fast phonemizer with rule-based G2P (Grapheme-to-Phoneme) prediction. Pure JavaScript implementation with no native dependencies.

Inspired by ttstokenizer

Features

⚡ Lightning fast - Pure rule-based processing, no ML overhead
🎯 Intelligent compound word support - Automatic decomposition of complex words
📚 Comprehensive dictionary - 125,000+ word pronunciations
🧠 Smart rule-based G2P - Advanced phonetic rules for unknown words
🌍 Multiple formats - IPA and ARPABET output
🌐 Multilingual support - Chinese, Japanese, Korean and more via anyAscii
💻 Pure JavaScript - No native dependencies, works everywhere
🔧 Simple API - Easy to integrate and use

Installation

npm install phonemize

Quick Start

import { phonemize, toIPA, toARPABET } from 'phonemize'

// Default IPA output
console.log(phonemize('Hello world!'))
// Output: həˈɫoʊ ˈwɝɫd!

// ARPABET format
console.log(toARPABET('Hello world!'))
// Output: HH AX EL1 OW W1 ER EL D!

Smart Word Processing

Compound Word Decomposition

Automatically detects and decomposes compound words:

phonemize('supercar')    // → ˈsupɝˈkɑɹ (super + car)
phonemize('playground')  // → ˈpɫeɪˌɡɹaʊn (play + ground)  
phonemize('superman')    // → ˈsupɝˌmæn (super + man)

Multi-Compound Words

Handles extremely long compound words intelligently:

phonemize('supercalifragilisticexpialidocious')
phonemize('antidisestablishmentarianism')
phonemize('pneumonoultramicroscopicsilicovolcanoconiosss')

Multilingual Support

Supports multiple languages through anyAscii transliteration:

// Chinese (direct processing with tone numbers)
phonemize('你好世界')  // → ni˧˥ xɑʊ˨˩˦ ʂɻ̩˥˩ tɕiɛ˥˩
phonemize('北京')      // → peɪ˧˩˧ tɕiŋ˥˥

// Japanese (with anyAscii and rule-based processing)
phonemize('こんにちは', { anyAscii: true }) // → konnitɕiwa
phonemize('東京', { anyAscii: true })      // → tʊŋ˥˥ tɕiŋ˥˥

// Korean (with anyAscii and rule-based processing)
phonemize('안녕하세요', { anyAscii: true }) // → ʔannjʌŋhaseyo
phonemize('서울', { anyAscii: true })      // → ˈsoʊɫ

// Other languages fallback to English G2P after anyAscii
phonemize('Привет', { anyAscii: true })    // → ˈpɹaɪvɛt

Note: anyascii only ensures an approximation and is likely not the correct pronunciation

API Reference

Basic Functions

`phonemize(text, options?)`

Convert text to phonemes.

phonemize('Hello world!')                    // IPA string
phonemize('Hello world!', { returnArray: true })  // IPA array

Options:

returnArray (boolean): Return array instead of string
format ('ipa' | 'arpabet'): Output format
stripStress (boolean): Remove stress markers
separator (string): Phoneme separator (default: ' ')
anyAscii (boolean): Enable multilingual support via anyAscii transliteration

`toIPA(text, options?)`

Convert text to IPA phonemes.

toIPA('Hello world!')  // "həˈɫoʊ ˈwɝɫd!"

`toARPABET(text, options?)`

Convert text to ARPABET phonemes.

toARPABET('Hello world!')  // "HH AX L OW1 W ER1 L D!"

`toZhuyin(text, options?)`

Convert text to Zhuyin (Bopomofo / 注音) format.

This function is specifically designed for Chinese text. Non-Chinese text will be phonemized to IPA as a fallback.

Note: The output format is Zhuyin + tone number (e.g., ㄓㄨㄥ1 ㄨㄣ2), which is optimized for Kokoro.

import { toZhuyin } from 'phonemize';

toZhuyin('中文'); // "ㄓㄨㄥ1 ㄨㄣ2"
toZhuyin('你好世界'); // "ㄋㄧ3 ㄏㄠ3 ㄕ4 ㄐㄧㄝ4"
toZhuyin('中文 and English'); // "ㄓㄨㄥ1 ㄨㄣ2 ænd ˈɪŋɡlɪʃ"

Custom Pronunciations

import { addPronunciation } from 'phonemize'

// Add custom word pronunciation
addPronunciation('myword', 'ˈmaɪwərd') // Can be IPA or ARPABET
console.log(phonemize('myword'))  // "ˈmaɪwərd"

Advanced Tokenization

import { Tokenizer, createTokenizer } from 'phonemize'

// Create custom tokenizer
const tokenizer = createTokenizer({
  format: 'ipa',
  stripStress: true,
  separator: '-'
})

// Tokenize with detailed info
const tokens = tokenizer.tokenizeToTokens('Hello world!')
// [
//   { phoneme: "həɫoʊ", word: "Hello", position: 0 },
//   { phoneme: "wɝɫd", word: "world", position: 6 }
// ]

Text Processing Features

Number Expansion

Numbers are automatically converted to words:

phonemize('I have 123 apples')
// "ˈaɪ ˈhæv ˈwən ˈhəndɝd ˈtwɛni ˈθɹi ˈæpəɫz"

Abbreviation Expansion

Common abbreviations are expanded:

phonemize('Dr. Smith and Mr. Johnson')
// "ˈdɑktɝ ˈsmɪθ ˈænd ˈmɪstɝ ˈdʒɑnsən"

Currency and Dates

Special handling for currency and dates:

phonemize('15 dollars in 2023')
// "ˈfɪfˈtin ˈdɑɫɝz ˈɪn ˈtwɛni ˈtwɛni ˈθɹi"

Performance

Dictionary lookup: O(1) - Instant for known words
Rule-based processing: Extremely fast, no model loading
Compound decomposition: Efficient balanced search algorithm
Memory efficient: Compressed JSON dictionaries only
Zero startup time: No model initialization required

Typical performance: >10000 words/second on modern hardware.

Processing Pipeline

Language Detection - Detect language before anyAscii conversion (if enabled)
anyAscii Transliteration - Convert non-Latin scripts to ASCII (if enabled)
Dictionary Lookup - Check for exact word match
Multilingual Processing - Handle Chinese, Japanese, Korean, etc.
Compound Detection - Intelligent decomposition of compound words
Multi-Compound Handling - Special processing for very long compounds
Rule-Based G2P - Apply phonetic rules for unknown words

Note: The rule based G2P is LLM generated, may error generate. Best practice is use custom pronunciation for unknown words.

Supported Phoneme Sets

IPA (International Phonetic Alphabet)

Standard IPA symbols for English phonemes with stress marks.

ARPABET

CMU ARPABET phoneme set with stress numbers (0,1,2).

Building from Source

# Install dependencies
yarn

# Compile TypeScript and dictionaries
yarn build

# Run tests
yarn test

License

MIT