Package Exports

multilingual-tokenizer
multilingual-tokenizer/multilingual-tokenizer.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (multilingual-tokenizer) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Multilingual Tokenizer

A Node.js library for tokenizing text in multiple languages (Thai, English, Japanese, and Korean) using regex-based approaches.

Features

Support for tokenizing text in:
- English
- Thai
- Japanese
- Korean
Automatic language detection
Token classification (word, number, punctuation, etc.)
Normalization options
Whitespace preservation options
Simple, lightweight implementation using regular expressions

Installation

npm install multilingual-tokenizer

Usage

const {
  MultilingualTokenizer,
  TOKEN_TYPES,
} = require("multilingual-tokenizer");

// Create a new tokenizer instance
const tokenizer = new MultilingualTokenizer({
  preserveWhitespace: true, // Keep whitespace tokens
  normalizeText: true, // Apply Unicode normalization
});

// Tokenize English text
const englishText = "Hello, world!";
const englishTokens = tokenizer.tokenize(englishText);
console.log(englishTokens);

// Tokenize Thai text
const thaiText = "สวัสดีครับ";
const thaiTokens = tokenizer.tokenize(thaiText);
console.log(thaiTokens);

// Tokenize Japanese text
const japaneseText = "こんにちは、世界！";
const japaneseTokens = tokenizer.tokenize(japaneseText);
console.log(japaneseTokens);

// Tokenize Korean text
const koreanText = "안녕하세요, 세계!";
const koreanTokens = tokenizer.tokenize(koreanText);
console.log(koreanTokens);

// Force language selection
const forcedTokens = tokenizer.tokenize(englishText, "japanese");

// Extract only word tokens
const words = tokenizer.extractWords(englishTokens);

Token Structure

Each token is represented as an object with two properties:

{
  type: 'WORD',  // One of the values from TOKEN_TYPES
  value: 'Hello' // The actual token text
}

The available token types are:

WORD - Words and word-like constructs
NUMBER - Numeric values
SPACE - Whitespace (spaces, tabs, newlines)
PUNCTUATION - Punctuation marks
SYMBOL - Symbols (#, $, %, etc.)
OTHER - Unclassified characters

API Reference

Constructor

const tokenizer = new MultilingualTokenizer(options);

Options:

preserveWhitespace (default: false): Whether to include whitespace tokens in the output
normalizeText (default: true): Whether to apply Unicode normalization before tokenization

Methods

`tokenize(text, language = null)`

Tokenizes the input text. If language is not provided, it will be automatically detected.

text (string): The text to tokenize
language (string, optional): Force a specific language tokenizer ('english', 'thai', 'japanese', 'korean')
Returns: Array of token objects

`detectLanguage(text)`

Detects the dominant language in the text.

text (string): The text to analyze
Returns: String with language code ('english', 'thai', 'japanese', 'korean')

`extractWords(tokens)`

Extracts only the word tokens from an array of tokens.

tokens (array): Array of token objects
Returns: Array of strings (word values)

`detokenize(tokens)`

Converts tokens back to text.

tokens (array): Array of token objects
Returns: String of reconstructed text

Important Notes

This library uses regex-based tokenization, which is a simplified approach. For production use in applications requiring high accuracy in specific languages:

Thai: Consider using dictionary-based approaches (e.g., thai-tokenizer)
Japanese: Consider using morphological analyzers (e.g., kuromoji)
Korean: Consider using more sophisticated tokenizers (e.g., node-mecab-ya)

This library is intended for basic tokenization needs or cases where a lightweight solution is required.

License

MIT