Package Exports

romanize-string

Readme

romanize-string

Introduction

Romanize-string is a library for transliterating strings unidirectionally from non-Latin to Latin script. It unifies 10 different transliteration and parsing libraries—expanding upon some of them significantly in order to increase coverage—to create a single utility that can generate basic transliterations for 30 written languages.

Supported languages include Arabic, Belarusian*, Bulgarian*, Bengali, Cantonese, Chinese (Traditional and Simplified), Persian* (Farsi), Greek, Gujarati, Hindi, Japanese, Kazakh*, Kannada, Korean, Kyrgyz*, Macedonian*, Mongolian*, Marathi, Nepali, Punjabi, Russian, Sanskrit, Serbian*, Tamil, Telugu, Tajik*, Thai, Ukrainian, and Urdu*.

* Support for these languages is limited, as it was implemented without native fluency in those languages. They exist as custom extensions of the capabilities of the libraries arabic-transliterate and cyrillic-to-translit-js. Contributions from community members with deeper knowledge of these languages are welcome. For information on the implementation of these expansions, see the Technical Notes section.

About

I created this library in the process of working on a closed-source project. I was in need of a utility that could handle transliterating media titles from multiple languages into Latin script so that their romanized forms could be used for display and for creating searchable slugs. Unfortunately, not only did no such library exist (at least not that covered all the languages I had to work with), but some of the languages had no direct transliteration libraries at all. I found that transliterating some languages required me to construct multi-step processes drawing on multiple libraries, while others (Farsi and Urdu, in particular) required a significant amount of custom code in order to produce something usable. Here I've condensed all of that into a single, unidirectional transliteration engine.

Install

$ npm install romanize-string

Requires Node.js 16+ Supports both ESM and CommonJS

Usage

The romanizeString utility is capable of transliterating a string written in any of the supported languages. It cannot transliterate from multiple languages at once. For scripts without native capitalization (all except Cyrillic and Greek), the out romanized strings will be lowercase.

Because one of the underlying libraries is asynchronous, you must await calls to romanizeString.

Example:

import romanizeString from "romanize-string"

const output = await romanizeString("নমস্তে, আপনি কেমন আছেন?", "bn", false) // namaste, āpani kemana āchena?

Arguments:

input - A string in a supported script/language. languageCode - A supported language code of type ConvertibleLanguage omitDiacritics (optional) - A boolean indicating whether to omit diacritics from the output by controlling the transliteration scheme (defaults to false)

Returns:

A string in Latin script

NOTE: The parameter omitDiacritics only applies to Mandarin, Greek, and Indic languages. (For Mandarin, diacritics are used to indicate tones.) When transliterating from a language other than these, passing a value for omitDiacritics in your function call has no effect

Language Codes

Arabic Script

Code	Language
ar	Arabic
fa	Persian (Farsi)
ur	Urdu

Cyrillic Script

Code	Language
be	Belarusian
bg	Bulgarian
kk	Kazakh
ky	Kyrgyz
mk	Macedonian
mn	Mongolian
ru	Russian
sr	Serbian
tg	Tajik
uk	Ukrainian

Devanagari / Other Indic Scripts

Code	Language
bn	Bengali
gu	Gujarati
hi	Hindi
kn	Kannada
mr	Marathi
ne	Nepali
pa	Punjabi
sa	Sanskrit
ta	Tamil
te	Telugu

Greek Script

Code	Language
el	Greek

East and Southeast Asian Scripts

Code	Language
ja	Japanese
ko	Korean
th	Thai ¹
yue	Cantonese
zh-CN	Chinese (Simplified)
zh-Hant	Chinese (Traditional)

¹ Thai transliteration requires the presence of Python and the Python library pythainlp in the environment where the code is run. See the romanizeThai entry in Modular Imports for more details.

Examples

const translitFromJapanese = await romanizeString("ありがとう", "ja"); // "arigatō"
const translitFromRussian = await romanizeString("Привет", "ru");     // "privet"
const translitFromBengali = await romanizeString("বাংলা", "bn");       // "vāṃlā"
const translitFromBengaliAscii = await romanizeString("বাংলা", "bn", true);       // "vaamlaa"

This library also supports modular imports. For usage of each individual function, see Modular Imports.

TypeScript Support

The romanize-string library is fully typed and includes type exports for user-supplied arguments.

import {
    ConvertibleLanguage,
    CyrillicLanguageCode,
    IndicLanguageCode
} from "romanize-string"

Modular Imports

In addition to the default romanizeString function, this library also supports named imports for individual transliteration functions and type guard utilities. These can be imported directly to reduce bundle size or to access specialized functionality.

Method	Description	Args	Returns
`romanizeArabic()`	Transliterate Arabic script to Latin script	`input`	string
`romanizeCantonese()`	Transliterate Hanzi script to Latin script with Cantonese pronunciation	`input`	string
`romanizeCyrillic()`	Transliterate Cyrillic script to Latin script	`input`, `language`	string
`romanizeIndic()`	Transliterate an Indic script to Latin script	`input`, `language`, `omitDiacritics?`	string
`romanizeJapanese()`	Transliterate Kanji, Hiragana, or Katakana script to Latin script	`input`	Promise<String>
`romanizeKorean()`	Transliterate Hangul script to Latin script	`input`	string
`romanizeMandarin()`	Transliterate Hanzi script to Latin script using Mandarin pronunciation	`input`, `omitTones?`	string
`romanizeThai()`	Transliterate Thai script to Latin script	`input`	string
`isConvertibleLanguage()`	Check whether language code is included in the `ConvertibleLanguage` type	`languageCode`	boolean
`isCyrillicLanguageCode()`	Check whether language code is included in the `CyrillicLanguageCode` type	`languageCode`	boolean
`isIndicLanguageCode()`	Check whether language code is included in the `IndicLanguageCode` type	`languageCode`	boolean

Script-Based Transliteration Functions

These functions handle transliteration for specific script families.

import {
    romanizeArabic, 
    romanizeCantonese, 
    romanizeCyrillic, 
    romanizeIndic, 
    romanizeJapanese, 
    romanizeKorean, 
    romanizeMandarin, 
    romanizeThai
} from "romanize-string"

romanizeArabic

Transliterates from Arabic script.

Supported Languages: ar, fa, ur

const translit = romanizeArabic("مرحبا، كيف حالك؟") // maraḥabā,a kayafa ḥāl-k?

Arguments:

input - A string in Arabic script

Returns:

A string in Latin script

romanizeCantonese

Transliterates from Hanzi using Cantonese pronunciation.

Supported Language: yue

const translit = romanizeCantonese(你好，今日點呀) // lee ho, gam yat dim ah?

Arguments:

input - A string in Hanzi script

Returns:

A string in Latin script

romanizeCyrillic

Transliterates from Cyrillic.

Supported Languages: be, bg, kk, ky, mk, mn, ru, sr, tg, uk

const translit = romanizeCyrillic("Салам, кандайсың?" language: "ky") // Salam, kandaisyñ?

Arguments:

input - A string in Cyrillic script language - A language code of type CyrillicLanguageCode

Returns:

A string in Latin script

romanizeGreek

Transliterates from Greek script.

Supported Languages: el

const translit = romanizeGreek("Γειά σου, τι κάνεις", false) // Yeiá sou, ti káneis
const translitNoDia = romanizeGreek("Γειά σου, τι κάνεις", true) // Yeia sou, ti kaneis

Arguments:

input - string omitDiacritics (optional) - A boolean indicating whether to exclude diacritics in the output (defaults to false)

Returns:

A string in Latin script

romanizeIndic

Transliterates from Devanagari and other Indic scripts.

Supported Languages: bn, gu, hi, kn, mr, ne, pa, sa, ta, te

const translit = romanizeIndic("नमस्ते, आप कैसे हैं?", "hi", false) // namaste, āpa kaise haiṃ?
const translitNoDia = romanizeIndic("नमस्ते, आप कैसे हैं?", "hi", true) // namaste, aapa kaise haim?

Arguments:

input - string omitDiacritics (optional) - A boolean indicating whether to exclude diacritics in the output (defaults to false)

Returns:

A string in Latin script

romanizeJapanese

Transliterates from Kanji, Hiragana, or Katakana.

Supported Language: ja

const translit = await romanizeJapanese("こんにちは、お元気ですか？") // konnichiwa, o genkidesu ka?
const translitMixed = await romanizeJapanese("今日のディナーはカレーです。") // kyō no dinā wa karē desu.

Arguments:

input - string

Returns:

A promise resolving to a string in Latin script

NOTE: The supporting library responsible for Japanese transliteration ( Kuroshiro ) operates asynchronously. All calls to romanizeJapanese must therefore be awaited.

romanizeKorean

Transliterates from Hangul script.

Supported Language: ko

const translit = romanizeKorean("안녕하세요, 잘 지내세요?") // annyeonghaseyo, jal jinaeseyo?

Arguments:

input - string

Returns:

A string in Latin script

romanizeMandarin

Transliterates from both Traditional and Simplified Hanzi using Mandarin pronunciation.

Supported Languages: zh-CN, zh-Hant

const translitTrad = romanizeMandarin("你好，最近好嗎？", false) // nǐ hǎo, zuì jìn hǎo má ？
const translitTradNoDia = romanizeMandarin("你好，最近好嗎？", true) // ni hao, zui jin hao ma ？
const translitSimplified = romanizeMandarin("你好，最近好吗？", false) // nǐ hǎo, zuì jìn hǎo ma?

If not specified, omitTones defaults to "false". Arguments:

input - string omitTones (optional) - A boolean indicating whether to exclude diacritics that indicate tones from the output (defaults to false)

Returns:

A string in Latin script

romanizeThai

Transliterates from Thai script.

Supported Language: th

const translit = romanizeThai("สวัสดีครับ/ค่ะ สบายดีไหม?") // satti khnap/kha spaiti mai?

Arguments:

input - string

Returns:

A string in Latin script

NOTE: romanizeThai uses an external Python library ( pythainlp ) for the transliteration, since no suitable JavaScript library currently exists. As such, the function will only work if the environment in which it is run has both Python 3 and pythainlp installed. Attempts to use this function without one or both of them installed will return an untransliterated string and generate console errors explaining the problem.

To use romanizeThai, Python 3 and the pythainlp library must be available in your environment.

Make sure Python 3 is installed:
- Download Python if needed
Install the required library:
```
pip install pythainlp
```

If you're unsure which Python installation you're using:

python3 -m pip install pythainlp

Type Guards

These utilities help with validating language codes at runtime — useful for functions that require language code input.

import {
    isConvertibleLanguage,
    isCyrillicLanguageCode,
    isIndicLanguageCode
} from "romanize-string"

isConvertibleLanguage

Returns true if the given string is a supported language code from type ConvertibleLanguage.

isConvertibleLanguage("ja") // true

Arguments:

input - a language code

Returns:

A boolean indicating whether the given language code is of type ConvertibleLanguage

isCyrillicLanguageCode

isCyrillicLanguageCode("ru") // true

Arguments:

input - a language code

Returns:

A boolean indicating whether the given language code is of type CyrillicLanguageCode (a subset of ConvertibleLanguageCode)

isIndicLanguageCode

isIndicLanguageCode("hi") // true

Arguments:

input - a language code

Returns:

A boolean indicating whether the given language code is of type IndicLanguageCode (a subset of ConvertibleLanguageCode)

Dependencies and Attribution

This library draws on the capabilities of several existing libraries, many of which have been extended or combined to support broader functionality:

arabic-transliterate – used as the foundation for Arabic, Persian, and Urdu transliteration, with significant customizations, details of which are provided in the Technical Notes section.
@indic-transliteration/sanscript – provides base functionality for Devanagari and other Indic scripts.
kuroshiro – used for Japanese transliteration; includes async processing.
kuroshiro-analyzer-kuromoji – Japanese morphological analyzer required by Kuroshiro.
pinyin-pro – used for Mandarin transliteration from Simplified and Traditional Hanzi.
cantonese-romanisation – provides base mappings for Cantonese transliteration.
oktjs – used to tokenize and normalize Korean input before transliteration.
tnthai – used to segment Thai script into individual words before submitting them to the transliteration pipeline.
pythainlp – external Python library used for Thai transliteration. Note: This is not a direct JavaScript dependency. It must be installed manually (alongside Python 3) in the runtime environment for romanizeThai to function.

This project includes modified and vendored code from the following libraries:

cyrillic-to-translit-js by Aleksandr Filatov = MIT Licensed. Logic adapted and restructured to support additional Cyrillic languages. Not used as a dependency; see Technical Notes.
@romanize/korean by Kenneth Tang – MIT Licensed. Used for Hangul transliteration. Vendored and modified for structural compatibility. See src/vendor/romanize/korean/LICENSE.

Technical Notes

As of the time of this writing, the cyrillic-to-translit-js library only has presets for Russian, Mongolian, and Ukrainian. In order to expand upon the coverage it offered, its original code was integrated into this project with significant modifications. The support for reverse transliteration (Latin -> Cyrillic) was dropped, and new LLM-generated character maps were added for Belarusian, Bulgarian, Kazakh, Kyrgyz, Macedonian, Serbian, and Tajik.

Persian and Urdu posed a particular challenge, as the omission of short vowels in their written scripts makes straightforward character-mapping approaches insufficient for producing usable transliterations. This likely explains why no transliteration libraries currently support these languages. The imperfect approach taken in this library involves standardizing the Arabic script and then running it through the arabic-transliterate library. This standardization is done in three steps:

Common Persian and Urdu words are replaced with approximate LLM-generated phonetic forms (still in Arabic script), using lookup maps built from the Center for Language Engineering’s word frequency data for Persian and Urdu.
Remaining Persian- or Urdu-specific characters are replaced with their Arabic equivalents.
Short vowels are added to any remaining unvowelized words using a basic heuristic process.