Package Exports
- tokenize-is
- tokenize-is/package.json
Readme
tokenize-is
TypeScript tokenizer for Icelandic text. A port of Miðeind's Tokenizer.
Installation
npm install tokenize-is
# or
pnpm add tokenize-isUsage
import { tokenize, splitIntoSentences } from "tokenize-is";
// Tokenize text
const tokens = tokenize("Kl. 14:30 komu 100 gestir.");
for (const token of tokens) {
if (token.kind === "word") {
console.log(token.text);
} else if (token.kind === "number") {
console.log(token.value); // parsed number
}
}
// Split into sentences
const sentences = splitIntoSentences("Fyrst. Síðan.");
// → ["Fyrst.", "Síðan."]Token Types
All tokens have a kind discriminator for TypeScript narrowing:
| Kind | Description | Parsed Fields |
|---|---|---|
word |
Words | text |
number |
Numbers (Icelandic/English formats) | value |
ordinal |
Ordinal numbers (1., XVII.) | value |
time |
Time (14:30, kl. tvö) | hour, minute, second |
date |
ISO dates | year, month, day |
dateabs |
Absolute dates (17. júní 1944) | year, month, day |
daterel |
Relative dates (3. janúar) | month, day |
year |
Four-digit years | value |
amount |
Currency amounts (100 kr.) | value, currency |
currency |
Currency codes/symbols | iso |
measurement |
Values with units (5km, 220V) | value, unit |
percent |
Percentages | value |
url |
URLs | text |
domain |
Domain names | text |
email |
Email addresses | text |
hashtag |
Hashtags (#iceland) | text |
username |
@mentions | username |
numwletter |
Number+letter (14b, 33C) | value, letter |
telno |
Phone numbers | cc, number |
molecule |
Chemical formulas (H2O) | text |
ssn |
Icelandic kennitala | value |
serialnumber |
Serial numbers | text |
timestamp |
Date+time combined | year..second |
punctuation |
Punctuation | normalized, position |
Options
tokenize(text, {
replaceCompositeGlyphs: true, // Normalize Unicode (a + ́ → á)
includeSentenceMarkers: false, // Add s_begin/s_end tokens
includeOffsets: false, // Add span.start/end character offsets
});Token Offsets
When includeOffsets: true, each token includes a span with character positions:
const tokens = tokenize("Halló heimur", { includeOffsets: true });
// tokens[0] = { kind: "word", text: "Halló", span: { start: 0, end: 5 } }
// tokens[1] = { kind: "word", text: "heimur", span: { start: 6, end: 12 } }
// Extract original text from spans
const text = "Halló heimur";
text.slice(tokens[0].span.start, tokens[0].span.end); // "Halló"Port Fidelity
This is a TypeScript port of Miðeind's Tokenizer (MIT licensed).
Supported
- All 30 token types from the original
- Sentence boundary detection with abbreviation awareness
- Unicode normalization (composite glyphs)
- Icelandic number formats (1.234,56)
- Spelled-out time expressions (hálftvö → 1:30)
- ~100 Icelandic abbreviations
- 70+ SI units, 18+ currencies
- Kennitala (SSN) validation with checksum
Not Yet Implemented
detokenize()- reconstruct text from tokenscorrect_spaces()- fix spacing between tokensparagraphs()/mark_paragraphs()- paragraph handling- HTML entity unescaping (á → á)
- Full abbreviation list (300+ in original vs ~100 here)
Design Differences
- ESM-only (no CommonJS)
- Returns arrays instead of generators
- Discriminated unions instead of numeric token codes
- Zero runtime dependencies
Development
pnpm install
pnpm test # Run tests
pnpm build # Build with tsdown
pnpm check # Lint + format + typecheckLicense
MIT - same as the original Tokenizer.