tokenize-is

TypeScript tokenizer for Icelandic text. A port of Miðeind's Tokenizer.

Installation

npm install tokenize-is
# or
pnpm add tokenize-is

Usage

import { tokenize, splitIntoSentences } from "tokenize-is";

// Tokenize text
const tokens = tokenize("Kl. 14:30 komu 100 gestir.");
for (const token of tokens) {
  if (token.kind === "word") {
    console.log(token.text);
  } else if (token.kind === "number") {
    console.log(token.value); // parsed number
  }
}

// Split into sentences
const sentences = splitIntoSentences("Fyrst. Síðan.");
// → ["Fyrst.", "Síðan."]

Token Types

All tokens have a kind discriminator for TypeScript narrowing:

Kind	Description	Parsed Fields
`word`	Words	`text`
`number`	Numbers (Icelandic/English formats)	`value`
`ordinal`	Ordinal numbers (1., XVII.)	`value`
`time`	Time (14:30, kl. tvö)	`hour`, `minute`, `second`
`date`	ISO dates	`year`, `month`, `day`
`dateabs`	Absolute dates (17. júní 1944)	`year`, `month`, `day`
`daterel`	Relative dates (3. janúar)	`month`, `day`
`year`	Four-digit years	`value`
`amount`	Currency amounts (100 kr.)	`value`, `currency`
`currency`	Currency codes/symbols	`iso`
`measurement`	Values with units (5km, 220V)	`value`, `unit`
`percent`	Percentages	`value`
`url`	URLs	`text`
`domain`	Domain names	`text`
`email`	Email addresses	`text`
`hashtag`	Hashtags (#iceland)	`text`
`username`	@mentions	`username`
`numwletter`	Number+letter (14b, 33C)	`value`, `letter`
`telno`	Phone numbers	`cc`, `number`
`molecule`	Chemical formulas (H2O)	`text`
`ssn`	Icelandic kennitala	`value`
`serialnumber`	Serial numbers	`text`
`timestamp`	Date+time combined	`year`..`second`
`punctuation`	Punctuation	`normalized`, `position`

Options

tokenize(text, {
  replaceCompositeGlyphs: true, // Normalize Unicode (a + ́ → á)
  includeSentenceMarkers: false, // Add s_begin/s_end tokens
  includeOffsets: false, // Add span.start/end character offsets
});

Token Offsets

When includeOffsets: true, each token includes a span with character positions:

const tokens = tokenize("Halló heimur", { includeOffsets: true });
// tokens[0] = { kind: "word", text: "Halló", span: { start: 0, end: 5 } }
// tokens[1] = { kind: "word", text: "heimur", span: { start: 6, end: 12 } }

// Extract original text from spans
const text = "Halló heimur";
text.slice(tokens[0].span.start, tokens[0].span.end); // "Halló"

Port Fidelity

This is a TypeScript port of Miðeind's Tokenizer (MIT licensed).

Supported

All 30 token types from the original
Sentence boundary detection with abbreviation awareness
Unicode normalization (composite glyphs)
Icelandic number formats (1.234,56)
Spelled-out time expressions (hálftvö → 1:30)
~100 Icelandic abbreviations
70+ SI units, 18+ currencies
Kennitala (SSN) validation with checksum

Not Yet Implemented

detokenize() - reconstruct text from tokens
correct_spaces() - fix spacing between tokens
paragraphs() / mark_paragraphs() - paragraph handling
HTML entity unescaping (á → á)
Full abbreviation list (300+ in original vs ~100 here)

Design Differences

ESM-only (no CommonJS)
Returns arrays instead of generators
Discriminated unions instead of numeric token codes
Zero runtime dependencies

Development

pnpm install
pnpm test        # Run tests
pnpm build       # Build with tsdown
pnpm check       # Lint + format + typecheck

License

MIT - same as the original Tokenizer.