Package Exports
- @nlptools/tokenizer
- @nlptools/tokenizer/dist/index.mjs
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@nlptools/tokenizer) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
@nlptools/tokenizer
Tokenization utilities - HuggingFace tokenizers wrapper for NLPTools
This package provides convenient access to HuggingFace tokenization utilities through the NLPTools ecosystem. It includes fast, client-side tokenization for various LLM models and supports both browser and Node.js environments.
Installation
# Install with npm
npm install @nlptools/tokenizer
# Install with yarn
yarn add @nlptools/tokenizer
# Install with pnpm
pnpm add @nlptools/tokenizerUsage
Basic Setup
import { Tokenizer } from "@nlptools/tokenizer";Available Functions
- Tokenizer - Main tokenizer class for encoding and decoding text
- encode() - Convert text to token IDs and tokens
- decode() - Convert token IDs back to text
- tokenize() - Split text into token strings
- ** AddedToken** - Custom token configuration class
Example Usage
import { Tokenizer } from "@nlptools/tokenizer";
// Load tokenizer from HuggingFace Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(
`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`,
).then((res) => res.json());
const tokenizerConfig = await fetch(
`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`,
).then((res) => res.json());
// Create tokenizer instance
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Encode text
const encoded = tokenizer.encode("Hello World");
console.log(encoded.ids); // [9906, 4435]
console.log(encoded.tokens); // ['Hello', 'Δ World']
console.log(encoded.attention_mask); // [1, 1]
// Decode back to text
const decoded = tokenizer.decode(encoded.ids);
console.log(decoded); // 'Hello World'
// Get token count
const tokenCount = tokenizer.encode("This is a sentence.").ids.length;
console.log(`Token count: ${tokenCount}`);Features
- π Fast & Lightweight: Zero-dependency implementation for client-side use
- π§ Model Compatible: Works with HuggingFace model tokenizers
- π± Cross-Platform: Supports both browser and Node.js environments
- π¦ TypeScript First: Full type safety with comprehensive API
- π HuggingFace Hub: Direct integration with model repositories
References
This package incorporates and builds upon the following excellent open source projects:
- HuggingFace Tokenizers.js - Core tokenization implementations via
@huggingface/tokenizers