JSPM

multilingual-tokenizer

1.0.0
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • 0
  • Score
    100M100P100Q52526F
  • License MIT

A Node.js library for tokenizing text in Thai, English, Japanese, and Korean using regex

Package Exports

  • multilingual-tokenizer
  • multilingual-tokenizer/multilingual-tokenizer.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (multilingual-tokenizer) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Multilingual Tokenizer

A Node.js library for tokenizing text in multiple languages (Thai, English, Japanese, and Korean) using regex-based approaches.

Features

  • Support for tokenizing text in:
    • English
    • Thai
    • Japanese
    • Korean
  • Automatic language detection
  • Token classification (word, number, punctuation, etc.)
  • Normalization options
  • Whitespace preservation options
  • Simple, lightweight implementation using regular expressions

Installation

npm install multilingual-tokenizer

Usage

const {
  MultilingualTokenizer,
  TOKEN_TYPES,
} = require("multilingual-tokenizer");

// Create a new tokenizer instance
const tokenizer = new MultilingualTokenizer({
  preserveWhitespace: true, // Keep whitespace tokens
  normalizeText: true, // Apply Unicode normalization
});

// Tokenize English text
const englishText = "Hello, world!";
const englishTokens = tokenizer.tokenize(englishText);
console.log(englishTokens);

// Tokenize Thai text
const thaiText = "สวัสดีครับ";
const thaiTokens = tokenizer.tokenize(thaiText);
console.log(thaiTokens);

// Tokenize Japanese text
const japaneseText = "こんにちは、世界!";
const japaneseTokens = tokenizer.tokenize(japaneseText);
console.log(japaneseTokens);

// Tokenize Korean text
const koreanText = "안녕하세요, 세계!";
const koreanTokens = tokenizer.tokenize(koreanText);
console.log(koreanTokens);

// Force language selection
const forcedTokens = tokenizer.tokenize(englishText, "japanese");

// Extract only word tokens
const words = tokenizer.extractWords(englishTokens);

Token Structure

Each token is represented as an object with two properties:

{
  type: 'WORD',  // One of the values from TOKEN_TYPES
  value: 'Hello' // The actual token text
}

The available token types are:

  • WORD - Words and word-like constructs
  • NUMBER - Numeric values
  • SPACE - Whitespace (spaces, tabs, newlines)
  • PUNCTUATION - Punctuation marks
  • SYMBOL - Symbols (#, $, %, etc.)
  • OTHER - Unclassified characters

API Reference

Constructor

const tokenizer = new MultilingualTokenizer(options);

Options:

  • preserveWhitespace (default: false): Whether to include whitespace tokens in the output
  • normalizeText (default: true): Whether to apply Unicode normalization before tokenization

Methods

tokenize(text, language = null)

Tokenizes the input text. If language is not provided, it will be automatically detected.

  • text (string): The text to tokenize
  • language (string, optional): Force a specific language tokenizer ('english', 'thai', 'japanese', 'korean')
  • Returns: Array of token objects

detectLanguage(text)

Detects the dominant language in the text.

  • text (string): The text to analyze
  • Returns: String with language code ('english', 'thai', 'japanese', 'korean')

extractWords(tokens)

Extracts only the word tokens from an array of tokens.

  • tokens (array): Array of token objects
  • Returns: Array of strings (word values)

detokenize(tokens)

Converts tokens back to text.

  • tokens (array): Array of token objects
  • Returns: String of reconstructed text

Important Notes

This library uses regex-based tokenization, which is a simplified approach. For production use in applications requiring high accuracy in specific languages:

  • Thai: Consider using dictionary-based approaches (e.g., thai-tokenizer)
  • Japanese: Consider using morphological analyzers (e.g., kuromoji)
  • Korean: Consider using more sophisticated tokenizers (e.g., node-mecab-ya)

This library is intended for basic tokenization needs or cases where a lightweight solution is required.

License

MIT