JSPM

tokenize-is

0.2.0
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 211
  • Score
    100M100P100Q80739F
  • License MIT

TypeScript tokenizer for Icelandic text

Package Exports

  • tokenize-is
  • tokenize-is/package.json

Readme

tokenize-is

TypeScript tokenizer for Icelandic text. A port of Miðeind's Tokenizer.

Installation

npm install tokenize-is
# or
pnpm add tokenize-is

Usage

import { tokenize, splitIntoSentences } from "tokenize-is";

// Tokenize text
const tokens = tokenize("Kl. 14:30 komu 100 gestir.");
for (const token of tokens) {
  if (token.kind === "word") {
    console.log(token.text);
  } else if (token.kind === "number") {
    console.log(token.value); // parsed number
  }
}

// Split into sentences
const sentences = splitIntoSentences("Fyrst. Síðan.");
// → ["Fyrst.", "Síðan."]

Token Types

All tokens have a kind discriminator for TypeScript narrowing:

Kind Description Parsed Fields
word Words text
number Numbers (Icelandic/English formats) value
ordinal Ordinal numbers (1., XVII.) value
time Time (14:30, kl. tvö) hour, minute, second
date ISO dates year, month, day
dateabs Absolute dates (17. júní 1944) year, month, day
daterel Relative dates (3. janúar) month, day
year Four-digit years value
amount Currency amounts (100 kr.) value, currency
currency Currency codes/symbols iso
measurement Values with units (5km, 220V) value, unit
percent Percentages value
url URLs text
domain Domain names text
email Email addresses text
hashtag Hashtags (#iceland) text
username @mentions username
numwletter Number+letter (14b, 33C) value, letter
telno Phone numbers cc, number
molecule Chemical formulas (H2O) text
ssn Icelandic kennitala value
serialnumber Serial numbers text
timestamp Date+time combined year..second
punctuation Punctuation normalized, position

Options

tokenize(text, {
  replaceCompositeGlyphs: true, // Normalize Unicode (a + ́ → á)
  includeSentenceMarkers: false, // Add s_begin/s_end tokens
  includeOffsets: false, // Add span.start/end character offsets
});

Token Offsets

When includeOffsets: true, each token includes a span with character positions:

const tokens = tokenize("Halló heimur", { includeOffsets: true });
// tokens[0] = { kind: "word", text: "Halló", span: { start: 0, end: 5 } }
// tokens[1] = { kind: "word", text: "heimur", span: { start: 6, end: 12 } }

// Extract original text from spans
const text = "Halló heimur";
text.slice(tokens[0].span.start, tokens[0].span.end); // "Halló"

Port Fidelity

This is a TypeScript port of Miðeind's Tokenizer (MIT licensed).

Supported

  • All 30 token types from the original
  • Sentence boundary detection with abbreviation awareness
  • Unicode normalization (composite glyphs)
  • Icelandic number formats (1.234,56)
  • Spelled-out time expressions (hálftvö → 1:30)
  • ~100 Icelandic abbreviations
  • 70+ SI units, 18+ currencies
  • Kennitala (SSN) validation with checksum

Not Yet Implemented

  • detokenize() - reconstruct text from tokens
  • correct_spaces() - fix spacing between tokens
  • paragraphs() / mark_paragraphs() - paragraph handling
  • HTML entity unescaping (á → á)
  • Full abbreviation list (300+ in original vs ~100 here)

Design Differences

  • ESM-only (no CommonJS)
  • Returns arrays instead of generators
  • Discriminated unions instead of numeric token codes
  • Zero runtime dependencies

Development

pnpm install
pnpm test        # Run tests
pnpm build       # Build with tsdown
pnpm check       # Lint + format + typecheck

License

MIT - same as the original Tokenizer.