Package Exports
- multilingual-tokenizer
- multilingual-tokenizer/multilingual-tokenizer.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (multilingual-tokenizer) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Multilingual Tokenizer
A Node.js library for tokenizing text in multiple languages (Thai, English, Japanese, and Korean) using regex-based approaches.
Features
- Support for tokenizing text in:
- English
- Thai
- Japanese
- Korean
- Automatic language detection
- Token classification (word, number, punctuation, etc.)
- Normalization options
- Whitespace preservation options
- Simple, lightweight implementation using regular expressions
Installation
npm install multilingual-tokenizerUsage
const {
MultilingualTokenizer,
TOKEN_TYPES,
} = require("multilingual-tokenizer");
// Create a new tokenizer instance
const tokenizer = new MultilingualTokenizer({
preserveWhitespace: true, // Keep whitespace tokens
normalizeText: true, // Apply Unicode normalization
});
// Tokenize English text
const englishText = "Hello, world!";
const englishTokens = tokenizer.tokenize(englishText);
console.log(englishTokens);
// Tokenize Thai text
const thaiText = "สวัสดีครับ";
const thaiTokens = tokenizer.tokenize(thaiText);
console.log(thaiTokens);
// Tokenize Japanese text
const japaneseText = "こんにちは、世界!";
const japaneseTokens = tokenizer.tokenize(japaneseText);
console.log(japaneseTokens);
// Tokenize Korean text
const koreanText = "안녕하세요, 세계!";
const koreanTokens = tokenizer.tokenize(koreanText);
console.log(koreanTokens);
// Force language selection
const forcedTokens = tokenizer.tokenize(englishText, "japanese");
// Extract only word tokens
const words = tokenizer.extractWords(englishTokens);Token Structure
Each token is represented as an object with two properties:
{
type: 'WORD', // One of the values from TOKEN_TYPES
value: 'Hello' // The actual token text
}The available token types are:
WORD- Words and word-like constructsNUMBER- Numeric valuesSPACE- Whitespace (spaces, tabs, newlines)PUNCTUATION- Punctuation marksSYMBOL- Symbols (#, $, %, etc.)OTHER- Unclassified characters
API Reference
Constructor
const tokenizer = new MultilingualTokenizer(options);Options:
preserveWhitespace(default:false): Whether to include whitespace tokens in the outputnormalizeText(default:true): Whether to apply Unicode normalization before tokenization
Methods
tokenize(text, language = null)
Tokenizes the input text. If language is not provided, it will be automatically detected.
text(string): The text to tokenizelanguage(string, optional): Force a specific language tokenizer ('english', 'thai', 'japanese', 'korean')- Returns: Array of token objects
detectLanguage(text)
Detects the dominant language in the text.
text(string): The text to analyze- Returns: String with language code ('english', 'thai', 'japanese', 'korean')
extractWords(tokens)
Extracts only the word tokens from an array of tokens.
tokens(array): Array of token objects- Returns: Array of strings (word values)
detokenize(tokens)
Converts tokens back to text.
tokens(array): Array of token objects- Returns: String of reconstructed text
Important Notes
This library uses regex-based tokenization, which is a simplified approach. For production use in applications requiring high accuracy in specific languages:
- Thai: Consider using dictionary-based approaches (e.g., thai-tokenizer)
- Japanese: Consider using morphological analyzers (e.g., kuromoji)
- Korean: Consider using more sophisticated tokenizers (e.g., node-mecab-ya)
This library is intended for basic tokenization needs or cases where a lightweight solution is required.
License
MIT