Package Exports
- @nlptools/nlptools
- @nlptools/nlptools/dist/index.mjs
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@nlptools/nlptools) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
@nlptools/nlptools
Main NLPTools package - Complete suite of NLP algorithms and utilities
This is the main NLPTools package (@nlptools/nlptools) that exports all algorithms and utilities from the entire toolkit. It provides a single entry point to access all string distance, similarity algorithms, text splitting, and tokenization utilities.
Features
- 🎯 All-in-One: Complete access to all NLPTools algorithms
- 📦 Convenient: Single import for all functionality
- ✂️ Text Splitting: Document chunking and text processing utilities
- 🪙 Tokenization: Fast text encoding and decoding for LLM models
- 📏 Distance & Similarity: Comprehensive string comparison algorithms
- 🚀 Performance Optimized: Automatically uses the fastest implementations available
- 📝 TypeScript First: Full type safety with comprehensive API
- 🔧 Easy to Use: Consistent API across all algorithms
Installation
# Install with npm
npm install @nlptools/nlptools
# Install with yarn
yarn add @nlptools/nlptools
# Install with pnpm
pnpm add @nlptools/nlptoolsUsage
Basic Setup
import * as nlptools from "@nlptools/nlptools";
// All algorithms are available as named functions
console.log(nlptools.levenshtein("kitten", "sitting")); // 3
console.log(nlptools.jaro("hello", "hallo")); // 0.8666666666666667
console.log(nlptools.cosine("abc", "bcd")); // 0.6666666666666666Distance vs Similarity
Most algorithms have both distance and normalized versions:
// Distance algorithms (lower is more similar)
const distance = nlptools.levenshtein("cat", "bat"); // 1
// Similarity algorithms (higher is more similar, 0-1 range)
const similarity = nlptools.levenshtein_normalized("cat", "bat"); // 0.6666666666666666Text Splitting
This package includes text splitters from @nlptools/splitter:
import { RecursiveCharacterTextSplitter } from "@nlptools/nlptools";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const text = "Your long document text here...";
const chunks = await splitter.splitText(text);
console.log(chunks);Tokenization
This package includes tokenization utilities from @nlptools/tokenizer:
import { Tokenizer } from "@nlptools/nlptools";
// Load tokenizer from HuggingFace Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(
`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`,
).then((res) => res.json());
const tokenizerConfig = await fetch(
`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`,
).then((res) => res.json());
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Encode text
const encoded = tokenizer.encode("Hello World");
console.log(encoded.ids); // [9906, 4435]
console.log(encoded.tokens); // ['Hello', 'ĠWorld']
// Get token count
const tokenCount = tokenizer.encode("This is a sentence.").ids.length;
console.log(`Token count: ${tokenCount}`);Available Algorithm Categories
This package includes all algorithms from @nlptools/distance, @nlptools/splitter, and @nlptools/tokenizer:
Edit Distance Algorithms
levenshtein- Classic edit distancefastest_levenshtein- High-performance Levenshtein distancedamerau_levenshtein- Edit distance with transpositionsmyers_levenshtein- Myers bit-parallel algorithmjaro- Jaro similarityjarowinkler- Jaro-Winkler similarityhamming- Hamming distance for equal-length stringssift4_simple- SIFT4 algorithm
Sequence-based Algorithms
lcs_seq- Longest common subsequencelcs_str- Longest common substringratcliff_obershelp- Gestalt pattern matchingsmith_waterman- Local sequence alignment
Token-based Algorithms
jaccard- Jaccard similaritycosine- Cosine similaritysorensen- Sørensen-Dice coefficienttversky- Tversky indexoverlap- Overlap coefficient
Bigram Algorithms
jaccard_bigram- Jaccard similarity on character bigramscosine_bigram- Cosine similarity on character bigrams
Naive Algorithms
prefix- Prefix similaritysuffix- Suffix similaritylength- Length-based similarity
Text Splitters
RecursiveCharacterTextSplitter- Splits text recursively using different separatorsCharacterTextSplitter- Splits text by character countMarkdownTextSplitter- Specialized splitter for Markdown documentsTokenTextSplitter- Splits text by token countLatexTextSplitter- Specialized splitter for LaTeX documents
Tokenization Utilities
Tokenizer- Main tokenizer class for encoding and decoding textencode()- Convert text to token IDs and tokensdecode()- Convert token IDs back to texttokenize()- Split text into token stringsAddedToken- Custom token configuration class
Universal Compare Function
const result = nlptools.compare("hello", "hallo", "jaro");
console.log(result); // 0.8666666666666667Performance
The package automatically selects the fastest implementation available:
- WebAssembly algorithms: 10-100x faster than pure JavaScript
- High-performance implementations: Including fastest-levenshtein for optimal speed