JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 16
  • Score
    100M100P100Q41961F
  • License UNLICENSED

4-byte-width (UTF-32) characters and unsigned integers for working with strings

Package Exports

  • utf32char

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (utf32char) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

UTF32Char

A minimalist, dependency-free implementation of immutable 4-byte-width (UTF-32) characters for easy manipulation of characters and glyphs, including simple emoji.

Also includes an immutable unsigned 4-byte-width integer data type, UInt32 and easy conversions from and to UTF32Char.

Motivation

If you want to allow a single "character" of input, but consider emoji to be single characters, you'll have some difficulty using basic JavaScript strings, which use UTF-16 encoding by default. While ASCII characters all have length-1...

console.log("?".length) // 1

...many emoji have length > 1

console.log("ðŸ’Đ".length) // 2

...and with modifiers and accents, that number can get much larger

console.log("!ĖŋĖ‹ÍĨÍĨĖ‚ÍĢĖĖĖÍžÍœÍ–ĖŽĖ°Ė™Ė—".length) // 17

As all Unicode characters can be expressed with a fixed-length UTF-32 encoding, this package mitigates the problem a bit, though it doesn't completely solve it. Note that I do not claim to have solved this issue, and this package accepts any group of one to four bytes as a "single UTF-32 character", whether or not they are rendered as a single grapheme. See this package if you want to split text into graphemes, regardless of the number of bytes required to render each grapheme.

If you just want a simple, dependency-free API to deal with 4-byte strings, then this package is for you.

This package provides an implementation of 4-byte, UTF-32 "characters" UTF32Char and corresponding unsigned integers UInt32. The unsigned integers have an added benefit of being usable as safe array indices.

Installation

Install from npm with

$ npm i utf32char

Or try it online at npm.runkit.com

var lib = require("utf32char")

let char = new lib.UTF32Char("ðŸ˜Ū")

Use

Create new UTF32Chars and UInt32s like so

let index: UInt32 = new UInt32(42)
let char: UTF32Char = new UTF32Char("ðŸ˜Ū")

You can convert to basic JavaScript types

console.log(index.toNumber()) // 42
console.log(char.toString())  // ðŸ˜Ū

Easily convert between characters and integers

let indexAsChar: UTF32Char = index.toUTF32Char()
let charAsUInt: UInt32 = char.toUInt32()

console.log(indexAsChar.toString()) // *
console.log(charAsUInt.toNumber())  // 3627933230

...or skip the middleman and convert integers directly to strings, or strings directly to integers:

console.log(index.toString()) // *
console.log(char.toNumber())  // 3627933230

Edge Cases

UInt32 and UTF32Char ranges are enforced upon object creation, so you never have to worry about bounds checking:

let tooLow: UInt32 = UInt32.fromNumber(-1)
// range error: UInt32 has MIN_VALUE 0, received -1

let tooHigh: UInt32 = UInt32.fromNumber(2**32)
// range error: UInt32 has MAX_VALUE 4294967295 (2^32 - 1), received 4294967296

let tooShort: UTF32Char = UTF32Char.fromString("")
// invalid argument: cannot convert empty string to UTF32Char

let tooLong: UTF32Char = UTF32Char.fromString("hey!")
// invalid argument: lossy compression of length-3+ string to UTF32Char

Because the implementation accepts any 4-byte string as a "character", the following are allowed

let char: UTF32Char = UTF32Char.fromString("hi")
let num: number = char.toNumber()

console.log(num) // 6815849
console.log(char.toString()) // hi
console.log(UTF32Char.fromNumber(num).toString()) // hi

Floating-point values are truncated to integers when creating UInt32s, like in many other languages:

let pi: UInt32 = UInt32.fromNumber(3.141592654)
console.log(pi.toNumber()) // 3

let squeeze: UInt32 = UInt32.fromNumber(UInt32.MAX_VALUE + 0.9)
console.log(squeeze.toNumber()) // 4294967295

Compound emoji -- created using variation selectors and joiners -- are often larger than 4 bytes wide and will therefore throw errors when used to construct UTF32Chars:

let smooch: UTF32Char = UTF32Char.fromString("ðŸ‘Đ‍âĪïļâ€ðŸ’‹â€ðŸ‘Đ")
// invalid argument: lossy compression of length-3+ string to UTF32Char

console.log("ðŸ‘Đ‍âĪïļâ€ðŸ’‹â€ðŸ‘Đ".length) // 11

...but many basic emoji are fine:

// emojiTest.ts
let emoji: Array<string> = [ "😂", "😭", "ðŸĨš", "ðŸĪĢ", "âĪïļ", "âœĻ", "😍", "🙏", "😊", "ðŸĨ°", "👍", "💕", "ðŸĪ”", "ðŸ‘Đ‍âĪïļâ€ðŸ’‹â€ðŸ‘Đ" ]

for (const e of emoji) {
  try {
    UTF32Char.fromString(e)
    console.log(`✅: ${e}`)
  } catch (_) {
    console.log(`❌: ${e}`)
  }
}
$ npx ts-node emojiTest.ts
✅: 😂
✅: 😭
✅: ðŸĨš
✅: ðŸĪĢ
✅: âĪïļ
✅: âœĻ
✅: 😍
✅: 🙏
✅: 😊
✅: ðŸĨ°
✅: 👍
✅: 💕
✅: ðŸĪ”
❌: ðŸ‘Đ‍âĪïļâ€ðŸ’‹â€ðŸ‘Đ

Arithmetic, Comparison, and Immutability

UInt32 provides basic arithmetic and comparison operators

let increased: UInt32 = index.plus(19)
console.log(increased.toNumber()) // 61

let comp: boolean = increased.greaterThan(index)
console.log(comp) // true

Verbose versions and shortened aliases of comparison functions are available

  • lt and lessThan
  • gt and greaterThan
  • le and lessThanOrEqualTo
  • ge and greaterThanOrEqualTo

Since UInt32s are immutable, plus() and minus() return new objects, which are of course bounds-checked upon creation:

let whoops: UInt32 = increased.minus(100)
// range error: UInt32 has MIN_VALUE 0, received -39

Contact

Feel free to open an issue with any bug fixes or a PR with any performance improvements.

Support me @ Ko-fi!

Check out my DEV.to blog!