Package Exports

romanize-string

Readme

romanize-string

See the Changelog for details on recent updates.

Introduction
About
Installation
Usage
- Registering Plugins
- Language Codes
TypeScript Support
Modular Imports
- Script-Based Transliteration Functions
- Type Guards
Dependencies and Attribution
Technical Notes

Introduction

Romanize-string is a library for transliterating strings unidirectionally from non-Latin to Latin script. It unifies 10 different transliteration and parsing libraries—expanding upon some of them significantly in order to increase coverage—to create a single utility that can generate basic transliterations for 30 written languages.

Supported languages include Arabic, Belarusian*, Bulgarian*, Bengali, Cantonese, Chinese (Traditional and Simplified), Persian*(Farsi), Greek, Gujarati, Hindi, Japanese, Kazakh*, Kannada, Korean, Kyrgyz*, Macedonian*, Mongolian*, Marathi, Nepali, Punjabi, Russian, Sanskrit, Serbian*, Tamil, Telugu, Tajik*, Thai, Ukrainian, and Urdu*.

* Support for these languages is limited, as it was implemented without native fluency in those languages. They exist as custom extensions of the capabilities of the libraries arabic-transliterate and cyrillic-to-translit-js. Contributions from community members with deeper knowledge of these languages are welcome. For information on the implementation of these expansions, see the Technical Notes section.

About

I created this library in the process of working on a closed-source project. I was in need of a utility that could handle transliterating media titles from multiple languages into Latin script so that their romanized forms could be used for display and for creating searchable slugs. Unfortunately, not only did no such library exist (at least not that covered all the languages I had to work with), but some of the languages had no direct transliteration libraries at all. I found that transliterating some languages required me to construct multi-step processes drawing on multiple libraries, while others (Farsi and Urdu, in particular) required a significant amount of custom code in order to produce something usable. Here I've condensed all of that into a single, unidirectional transliteration engine.

Installation

npm install romanize-string

Requires Node.js 16+

Supports both ESM and CommonJS

Additional Installation for Enabling Thai Transliteration

Because no suitable JavaScript library exists for Thai transliteration, this library relies on the external Python project PyThaiNLP. You can enable Thai in one of two ways:

Install the thai-engine plugin (preferred): @romanize-string/thai-engine
Install Python libraries directly in your runtime environment (advanced)

If neither is set up, Thai romanization will fail and the function will return the original (untransliterated) string.

Installing the thai-engine plugin (preferred)

Install the plugin:

npm install @romanize-string/thai-engine

NOTE: This plugin requires Node.js 18+ (the core romanize-string library supports Node.js 16+). The plugin downloads a platform-specific helper binary (~50 MB) during install; no Python is required when using the plugin.

Installing Python libraries directly (advanced)

Ensure Python 3 is installed:
- Download Python if needed
Install the required libraries:

pip install pythainlp onnxruntime numpy

If you're unsure which Python installation you're using:

python3 -m pip install pythainlp onnxruntime numpy

Usage

The romanizeString utility is capable of transliterating a string written in any of the supported languages. It cannot transliterate from multiple languages at once. For scripts without native capitalization (all except Cyrillic and Greek), the output romanized strings will be lowercase.

Because one of the underlying libraries is asynchronous, you must await calls to romanizeString.

Example:

// Using ESM
import romanizeString from "romanize-string"

const translit = await romanizeString("নমস্তে, আপনি কেমন আছেন?", "bn", false) // namaste, āpani kemana āchena?

// Using CommonJS
const {default: romanizeString} = require("romanize-string")

const translit = await romanizeString("নমস্তে, আপনি কেমন আছেন?", "bn", false) // namaste, āpani kemana āchena?

Arguments:

input - A string in a supported script/language.

languageCode - A supported language code of type ConvertibleLanguage

omitDiacritics (optional) - A boolean indicating whether to omit diacritics from the output by controlling the transliteration scheme (defaults to false)

Returns:

A promise resolving to a string in Latin script

NOTE: The parameter omitDiacritics only applies to Mandarin, Greek, Cyrillic, and Indic languages. (For Mandarin, diacritics are used to indicate tones.) When transliterating from a language other than these, passing a value for omitDiacritics in your function call has no effect

Registering Plugins

Plugins can be registered once at app startup using either the global romanizeString.register(...) or the .register(...) method on any exported transliterator (e.g., romanizeThai.register(...), romanizeArabic.register(...)). Simply pass the default export of an official plugin directly to one of these methods. After registration, the plugin is available to romanizeString and to all script-specific transliterator functions. You do not need to register a plugin more than once.

// ESM
import romanizeString, { romanizeThai } from "romanize-string";
import thaiEngine from "@romanize-string/thai-engine";

// Register once at startup (pick ONE)
romanizeString.register(thaiEngine);
// or:
romanizeThai.register(thaiEngine);

// CommonJS
const { default: romanizeString, romanizeThai } = require("romanize-string");
const thaiEngine = require("@romanize-string/thai-engine");

// Register once at startup (pick ONE)
romanizeString.register(thaiEngine);
// or:
romanizeThai.register(thaiEngine);

Language Codes

Arabic Script

Code	Language
ar	Arabic
fa	Persian (Farsi)
ur	Urdu

Cyrillic Script

Code	Language
be	Belarusian
bg	Bulgarian
kk	Kazakh
ky	Kyrgyz
mk	Macedonian
mn	Mongolian
ru	Russian
sr	Serbian
tg	Tajik
uk	Ukrainian

Devanagari / Other Indic Scripts

Code	Language
bn	Bengali
gu	Gujarati
hi	Hindi
kn	Kannada
mr	Marathi
ne	Nepali
pa	Punjabi
sa	Sanskrit
ta	Tamil
te	Telugu

Greek Script

Code	Language
el	Greek

East and Southeast Asian Scripts

Code	Language
ja	Japanese
ko	Korean
th	Thai ¹
yue	Cantonese
zh-CN	Chinese (Simplified)
zh-Hant	Chinese (Traditional)

¹ Thai transliteration requires either the @romanize-string/thai-engine plugin (preferred) or Python + PyThaiNLP + ONNX Runtime + NumPy in your runtime. See Additional Installation for Enabling Thai Transliteration.

Examples

const translitFromJapanese = await romanizeString("ありがとう", "ja"); // "arigatō"
const translitFromRussian = await romanizeString("Привет", "ru");     // "privet"
const translitFromBengali = await romanizeString("বাংলা", "bn");       // "vāṃlā"
const translitFromBengaliAscii = await romanizeString("বাংলা", "bn", true);       // "vaamlaa"

This library also supports modular imports. For usage of each individual function, see Modular Imports.

TypeScript Support

The romanize-string library is fully typed and includes type exports for user-supplied arguments.

import {
    ConvertibleLanguage,
    CyrillicLanguageCode,
    IndicLanguageCode
} from "romanize-string"

Modular Imports

In addition to the default romanizeString function, this library also supports named imports for individual transliteration functions and type guard utilities. These can be imported individually to reduce bundle size or to access specialized functionality.

Method	Description	Args	Returns
`romanizeArabic()`	Transliterate Arabic script to Latin script	`input`	string
`romanizeCantonese()`	Transliterate Hanzi script to Latin script with Cantonese pronunciation	`input`	string
`romanizeCyrillic()`	Transliterate Cyrillic script to Latin script	`input`, `language`, `omitDiacritics?`	string
`romanizeGreek()`	Transliterate an Greek script to Latin script	`input`, `omitDiacritics?`	string
`romanizeIndic()`	Transliterate an Indic script to Latin script	`input`, `language`, `omitDiacritics?`	string
`romanizeJapanese()`	Transliterate Kanji, Hiragana, or Katakana script to Latin script	`input`	Promise<String>
`romanizeKorean()`	Transliterate Hangul script to Latin script	`input`	string
`romanizeMandarin()`	Transliterate Hanzi script to Latin script using Mandarin pronunciation	`input`, `omitTones?`	string
`romanizeThai()`	Transliterate Thai script to Latin script	`input`	string
`isConvertibleLanguage()`	Check whether language code is included in the `ConvertibleLanguage` type	`languageCode`	boolean
`isCyrillicLanguageCode()`	Check whether language code is included in the `CyrillicLanguageCode` type	`languageCode`	boolean
`isIndicLanguageCode()`	Check whether language code is included in the `IndicLanguageCode` type	`languageCode`	boolean

Import only what you need:

// ESM

import {romanizeArabic, isConvertibleLanguage} from "romanize-string"

// CommonJS

const {romanizeArabic, isConvertibleLanguage} = require("romanize-string")

Script-Based Transliteration Functions

These functions handle transliteration for specific script families.

Note
All script-based transliteration functions (e.g. romanizeThai) include an optional .register() method for adding plugins.
See Registering Plugins for details.

`romanizeArabic()`

Transliterates Arabic script.

Supported Languages: ar, fa, ur

const translit = romanizeArabic("مرحبا، كيف حالك؟") // maraḥabā, a kayafa ḥāl-k?

Arguments:

input - A string in Arabic script

Returns:

A string in Latin script

`romanizeCantonese()`

Transliterates Hanzi using Cantonese pronunciation.

Supported Language: yue

const translit = romanizeCantonese(你好，今日點呀) // lee ho, gam yat dim ah?

Arguments:

input - A string in Hanzi script

Returns:

A string in Latin script

`romanizeCyrillic()`

Transliterates Cyrillic.

Supported Languages: be, bg, kk, ky, mk, mn, ru, sr, tg, uk

const translit = romanizeCyrillic("Салам, кандайсың?", "ky") // Salam, kandaisyñ?

Arguments:

input - A string in Cyrillic script

language - A language code of type CyrillicLanguageCode

omitDiacritics (optional) - A boolean indicating whether to exclude diacritics in the output (defaults to false)

Returns:

A string in Latin script

NOTE: When omitDiacritics is false, romanizeCyrillic follows the BGN/PCGN romanization system for the selected language.

`romanizeGreek()`

Transliterates Greek script.

Supported Languages: el

const translit = romanizeGreek("Γειά σου, τι κάνεις", false) // Yeiá sou, ti káneis
const translitNoDia = romanizeGreek("Γειά σου, τι κάνεις", true) // Yeia sou, ti kaneis

Arguments:

input - A string in Greek script

omitDiacritics (optional) - A boolean indicating whether to exclude diacritics in the output (defaults to false)

Returns:

A string in Latin script

`romanizeIndic()`

Transliterates Devanagari and other Indic scripts.

Supported Languages: bn, gu, hi, kn, mr, ne, pa, sa, ta, te

const translit = romanizeIndic("नमस्ते, आप कैसे हैं?", "hi", false) // namaste, āpa kaise haiṃ?
const translitNoDia = romanizeIndic("नमस्ते, आप कैसे हैं?", "hi", true) // namaste, aapa kaise haim?

Arguments:

input - A string in an Indic script

omitDiacritics (optional) - A boolean indicating whether to exclude diacritics in the output (defaults to false)

Returns:

A string in Latin script

`romanizeJapanese()`

Transliterates Kanji, Hiragana, or Katakana. Tolerates a mix of these scripts within a single input.

Supported Language: ja

const translit = await romanizeJapanese("こんにちは、お元気ですか？") // konnichiwa, o genkidesu ka?
const translitMixed = await romanizeJapanese("今日のディナーはカレーです。") // kyō no dinā wa karē desu.

Arguments:

input - A string in any of the Japanese scripts

Returns:

A promise resolving to a string in Latin script

NOTE: The supporting library responsible for Japanese transliteration ( Kuroshiro ) operates asynchronously. All calls to romanizeJapanese must therefore be awaited.

`romanizeKorean()`

Transliterates Hangul script.

Supported Language: ko

const translit = romanizeKorean("안녕하세요, 잘 지내세요?") // annyeonghaseyo, jal jinaeseyo?

Arguments:

input - A string in Hangul script

Returns:

A string in Latin script

`romanizeMandarin()`

Transliterates both Traditional and Simplified Hanzi using Mandarin pronunciation.

Supported Languages: zh-CN, zh-Hant

const translitTrad = romanizeMandarin("你好，最近好嗎？", false) // nǐ hǎo, zuì jìn hǎo má？
const translitTradNoDia = romanizeMandarin("你好，最近好嗎？", true) // ni hao, zui jin hao ma？
const translitSimplified = romanizeMandarin("你好，最近好吗？", false) // nǐ hǎo, zuì jìn hǎo ma?

Arguments:

input - A string in Hanzi script, simplified or traditional

omitTones (optional) - A boolean indicating whether to exclude diacritics that indicate tones from the output (defaults to false)

Returns:

A string in Latin script

`romanizeThai()`

Transliterates Thai script.

Supported Language: th

const translit = romanizeThai("สวัสดีครับ/ค่ะ สบายดีไหม?") // sawatdi khrap/kha sabaidi haimai?

Arguments:

input - A string in Thai script

Returns:

A string in Latin script

NOTE: Thai transliteration requires either the @romanize-string/thai-engine plugin (preferred) or Python + PyThaiNLP + ONNX Runtime + NumPy in your runtime. See Additional Installation for Enabling Thai Transliteration.

You can enable the @romanize-string/thai-engine plugin by calling either romanizeThai.register(thaiEngine) or romanizeString.register(thaiEngine). See Registering Plugins

Type Guards

These utilities help with validating language codes at runtime — useful for functions that require language code input.

`isConvertibleLanguage()`

Returns true if the given string is a supported language code from type ConvertibleLanguage.

isConvertibleLanguage("ja") // true

Arguments:

input - a language code

Returns:

A boolean indicating whether the given language code is of type ConvertibleLanguage

`isCyrillicLanguageCode()`

isCyrillicLanguageCode("ru") // true

Arguments:

input - a language code

Returns:

A boolean indicating whether the given language code is of type CyrillicLanguageCode (a subset of ConvertibleLanguageCode)

`isIndicLanguageCode()`

isIndicLanguageCode("hi") // true

Arguments:

input - a language code

Returns:

A boolean indicating whether the given language code is of type IndicLanguageCode (a subset of ConvertibleLanguageCode)

Dependencies and Attribution

This library draws on the capabilities of several existing libraries, many of which have been extended or combined to support broader functionality:

arabic-transliterate – used as the foundation for Arabic, Persian, and Urdu transliteration, with significant customizations, details of which are provided in the Technical Notes section.
@indic-transliteration/sanscript – provides base functionality for Devanagari and other Indic scripts.
kuroshiro – used for Japanese transliteration; includes async processing.
kuroshiro-analyzer-kuromoji – Japanese morphological analyzer required by Kuroshiro.
pinyin-pro – used for Mandarin transliteration from Simplified and Traditional Hanzi.
cantonese-romanisation – provides base mappings for Cantonese transliteration.
oktjs – used to tokenize and normalize Korean input before transliteration.
tnthai – used to segment Thai script into individual words before submitting them to the transliteration pipeline.
pythainlp – external Python library used for Thai transliteration. Note: This is not a direct JavaScript dependency. It must be installed manually (alongside Python 3) in the runtime environment for romanizeThai to function.

This project includes modified and vendored code from the following libraries:

cyrillic-to-translit-js by Aleksandr Filatov – MIT Licensed. Logic adapted and restructured to support additional Cyrillic languages. Not used as a dependency; see Technical Notes.
@romanize/korean by Kenneth Tang – MIT Licensed. Used for Hangul transliteration. Vendored and modified for structural compatibility. See src/vendor/romanize/korean/LICENSE.

Technical Notes

As of the time of this writing, the cyrillic-to-translit-js library only has presets for Russian, Mongolian, and Ukrainian. In order to expand upon the coverage it offered, its original code was integrated into this project with significant modifications. The support for reverse transliteration (Latin -> Cyrillic) was dropped, and new LLM-generated character maps were added for Belarusian, Bulgarian, Kazakh, Kyrgyz, Macedonian, Serbian, and Tajik.

Persian and Urdu posed a particular challenge, as the omission of short vowels in their written scripts makes straightforward character-mapping approaches insufficient for producing usable transliterations. This likely explains why no transliteration libraries currently support these languages. The imperfect approach taken in this library involves standardizing the Arabic script and then running it through the arabic-transliterate library. This standardization is done in three steps:

Common Persian and Urdu words are replaced with approximate LLM-generated phonetic forms (still in Arabic script), using lookup maps built from word frequency data for Persian and Urdu provided by Projekt Deutscher Wortschatz. The data is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). Source: Wortschatz Corpora (https://wortschatz.uni-leipzig.de/en/download)
Remaining Persian- or Urdu-specific characters are replaced with their Arabic equivalents.
Short vowels are added to any remaining unvowelized words using a basic heuristic process.

JSPM

romanize-string

Package Exports

Readme

romanize-string

Table of Contents

Introduction

About

Installation

Additional Installation for Enabling Thai Transliteration

Installing the thai-engine plugin (preferred)

Installing Python libraries directly (advanced)

Usage

Registering Plugins

Language Codes

Arabic Script

Cyrillic Script

Devanagari / Other Indic Scripts

Greek Script

East and Southeast Asian Scripts

Examples

TypeScript Support

Modular Imports

Script-Based Transliteration Functions

`romanizeArabic()`

`romanizeCantonese()`

`romanizeCyrillic()`

`romanizeGreek()`

`romanizeIndic()`

`romanizeJapanese()`

`romanizeKorean()`

`romanizeMandarin()`

`romanizeThai()`

Type Guards

`isConvertibleLanguage()`

`isCyrillicLanguageCode()`

`isIndicLanguageCode()`

Dependencies and Attribution

Technical Notes