Package Exports

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@memberjunction/ai-vectors) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

@memberjunction/ai-vectors

Core foundation package for vector operations in MemberJunction. Provides text processing utilities (chunking, extraction), base classes for vectorization pipelines, and interfaces for embedding providers and vector databases.

Installation

npm install @memberjunction/ai-vectors

What's Included

Export	Type	Purpose
`TextChunker`	Class	Token-aware text splitting with sentence, paragraph, and fixed strategies
`TextExtractor`	Class	HTML stripping, entity decoding, MIME-type routing, token truncation
`VectorBase`	Class	Base class providing RunView, Metadata, AIEngine integration for subclasses
`IEmbedding`	Interface	Contract for single and batch text embedding generation
`IVectorDatabase`	Interface	Contract for vector database management (create/delete/list indexes)
`IVectorIndex`	Interface	Contract for CRUD operations on vector records within an index
`ChunkTextParams`	Type	Configuration for `TextChunker.ChunkText()`
`TextChunk`	Type	Output chunk with text, offsets, token count, and index
`PageRecordsParams`	Type	Paginated entity record retrieval configuration

Architecture

graph TD
    subgraph Core["@memberjunction/ai-vectors"]
        TC["TextChunker"]
        TE["TextExtractor"]
        VB["VectorBase"]
        IE["IEmbedding"]
        IVD["IVectorDatabase"]
        IVI["IVectorIndex"]
    end

    subgraph MJCore["MemberJunction Core"]
        MD["Metadata"]
        RV["RunView"]
        BE["BaseEntity"]
    end

    subgraph AIEngine["AI Engine"]
        AIM["AIEngine.Instance"]
        MOD["Embedding Models"]
        VDB["Vector Databases"]
    end

    subgraph Consumers["Consumer Packages"]
        SYNC["ai-vector-sync"]
        DUPE["ai-vector-dupe"]
    end

    VB --> MD
    VB --> RV
    VB --> BE
    VB --> AIM
    AIM --> MOD
    AIM --> VDB
    SYNC --> VB
    SYNC --> TC
    SYNC --> TE
    DUPE --> VB

    style Core fill:#2d6a9f,stroke:#1a4971,color:#fff
    style MJCore fill:#2d8659,stroke:#1a5c3a,color:#fff
    style AIEngine fill:#b8762f,stroke:#8a5722,color:#fff
    style Consumers fill:#7c5295,stroke:#563a6b,color:#fff

TextChunker

Token-aware text splitting that respects natural language boundaries. All methods are static.

Strategies

Strategy	Splits On	Best For
`sentence`	Sentence-ending punctuation (`.` `!` `?`)	Prose, articles, descriptions
`paragraph`	Double newlines (`\n\n`)	Structured documents, Markdown, reports
`fixed`	Whitespace boundaries at the character limit	Logs, code, unstructured data

Basic Usage

import { TextChunker, ChunkTextParams, TextChunk } from '@memberjunction/ai-vectors';

const article = `Machine learning models require training data.
The quality of training data directly impacts model performance.
Data preprocessing is a critical step in any ML pipeline.

Feature engineering transforms raw data into meaningful representations.
Good features can dramatically improve model accuracy.`;

// Sentence strategy (default)
const chunks: TextChunk[] = TextChunker.ChunkText({
    Text: article,
    MaxChunkTokens: 128,
    Strategy: 'sentence'
});

for (const chunk of chunks) {
    console.log(`Chunk ${chunk.Index}: ${chunk.TokenCount} tokens, offset ${chunk.StartOffset}-${chunk.EndOffset}`);
    console.log(chunk.Text);
}

Paragraph Strategy

const markdownDoc = `## Introduction

This document covers the architecture of our data pipeline.
It handles ingestion, transformation, and storage.

## Processing

Records are validated against schema constraints.
Invalid records are routed to a dead-letter queue.

## Storage

Processed data is stored in both relational and vector databases.
Vector embeddings enable semantic search across all records.`;

const chunks = TextChunker.ChunkText({
    Text: markdownDoc,
    MaxChunkTokens: 256,
    Strategy: 'paragraph'
});
// Each paragraph becomes a chunk (or paragraphs merge if they fit together)

Fixed Strategy

const logData = `2024-01-15T10:00:00Z INFO Server started on port 4000
2024-01-15T10:00:01Z INFO Connected to database
2024-01-15T10:00:02Z WARN High memory usage detected: 85%
2024-01-15T10:00:03Z ERROR Connection timeout after 30000ms`;

const chunks = TextChunker.ChunkText({
    Text: logData,
    MaxChunkTokens: 64,
    Strategy: 'fixed'
});

Configuring Overlap

Overlap repeats trailing content from the previous chunk at the start of the next chunk, preserving context across chunk boundaries. Defaults to 10% of MaxChunkTokens.

// Explicit overlap: 50 tokens of shared context between chunks
const chunks = TextChunker.ChunkText({
    Text: longDocument,
    MaxChunkTokens: 512,
    OverlapTokens: 50,
    Strategy: 'sentence'
});

// No overlap
const chunks = TextChunker.ChunkText({
    Text: longDocument,
    MaxChunkTokens: 512,
    OverlapTokens: 0,
    Strategy: 'sentence'
});

Token Estimation

EstimateTokenCount provides a fast approximation using the ~4 characters per token heuristic for English text. This is suitable for chunking where exact counts are not critical.

const tokens = TextChunker.EstimateTokenCount('This is a sample sentence.');
// Returns: 7 (26 characters / 4)

// For production accuracy with specific models, use tiktoken directly
// and pass the result to MaxChunkTokens for precise control

TextChunk Output Shape

Each chunk includes full position metadata for traceability back to the source:

interface TextChunk {
    Text: string;        // The chunk text content
    StartOffset: number; // Start character offset in original text
    EndOffset: number;   // End character offset (exclusive)
    TokenCount: number;  // Approximate token count
    Index: number;       // 0-based chunk index
}

TextExtractor

Static utilities for extracting clean plain text from various content formats. Dependency-light (regex-based, no DOM parser required).

HTML Extraction

import { TextExtractor } from '@memberjunction/ai-vectors';

const html = `
<html>
<head><style>body { color: red; }</style></head>
<body>
  <h1>Welcome</h1>
  <p>This is a <strong>formatted</strong> paragraph with &amp; entities.</p>
  <script>alert('removed');</script>
  <ul>
    <li>Item one</li>
    <li>Item two</li>
  </ul>
</body>
</html>`;

const text = TextExtractor.ExtractFromHTML(html);
// "Welcome\nThis is a formatted paragraph with & entities.\nItem one\nItem two"

What it does:

Removes <script> and <style> elements entirely
Converts block-level elements (<p>, <div>, <h1>-<h6>, <li>, <br>, etc.) to newlines
Strips all remaining HTML tags
Decodes named entities (&, <, >, ",  , —, …, etc.)
Decodes numeric entities (decimal © and hex ©)
Normalizes whitespace (collapses runs of spaces, limits consecutive newlines to 2)

Plain Text Normalization

const raw = "  Some text\x00with\x07control\x1Fcharacters\n\n\n\n\nand  extra   spaces  ";
const clean = TextExtractor.ExtractFromPlainText(raw);
// "Some textwithcontrolcharacters\n\nand extra spaces"

Removes control characters (\x00-\x1F except \n and \t), normalizes whitespace, trims.

MIME-Type Routing

// Automatically selects the right extraction method
const fromHTML = TextExtractor.ExtractByMimeType(htmlContent, 'text/html');
const fromPlain = TextExtractor.ExtractByMimeType(plainContent, 'text/plain');
const fromCSV = TextExtractor.ExtractByMimeType(csvContent, 'text/csv');  // Falls back to plain text

// For binary formats (PDF, DOCX), extract text with a dedicated library first,
// then pass through ExtractFromPlainText for normalization:
// const pdfText = await pdfParse(buffer);
// const clean = TextExtractor.ExtractFromPlainText(pdfText);

Token Truncation

// Truncate text to fit within a model's context window
const truncated = TextExtractor.TruncateToTokenLimit(veryLongText, 8192);
// Truncates at the last whitespace boundary before the estimated character limit

VectorBase

Abstract base class that downstream vector packages extend. Provides integrated access to MemberJunction's Metadata, RunView, and AIEngine systems.

Class Diagram

classDiagram
    class VectorBase {
        +Metadata : Metadata
        +RunView : RunView
        +CurrentUser : UserInfo
        #GetRecordsByEntityID(entityID, recordIDs?) BaseEntity[]
        #PageRecordsByEntityID~T~(params) T[]
        #GetAIModel(id?) MJAIModelEntityExtended
        #GetVectorDatabase(id?) MJVectorDatabaseEntity
        #RunViewForSingleValue~T~(entityName, filter) T | null
        #SaveEntity(entity) boolean
        #BuildExtraFilter(compositeKeys) string
    }

Extending VectorBase

import { VectorBase, PageRecordsParams } from '@memberjunction/ai-vectors';
import { BaseEntity } from '@memberjunction/core';

export class MyVectorProcessor extends VectorBase {
    async ProcessEntity(entityId: string): Promise<void> {
        // Load all records for an entity
        const records = await this.GetRecordsByEntityID(entityId);

        // Access configured AI models and vector databases
        const model = this.GetAIModel();       // First available embedding model
        const vectorDb = this.GetVectorDatabase(); // First available vector DB

        for (const record of records) {
            // Generate embeddings, upsert into vector DB
        }
    }

    async ProcessInPages(entityId: string): Promise<void> {
        let page = 1;
        let hasMore = true;

        while (hasMore) {
            const records = await this.PageRecordsByEntityID<Record<string, unknown>>({
                EntityID: entityId,
                PageNumber: page,
                PageSize: 100,
                ResultType: 'simple',
                Filter: "Status = 'Active'"
            });
            hasMore = records.length === 100;
            page++;
        }
    }
}

Filtering with Composite Keys

import { VectorBase } from '@memberjunction/ai-vectors';
import { CompositeKey } from '@memberjunction/core';

class FilteredProcessor extends VectorBase {
    async GetSpecificRecords(entityId: string): Promise<void> {
        const keys: CompositeKey[] = [
            { KeyValuePairs: [{ FieldName: 'ID', Value: 'abc-123' }] },
            { KeyValuePairs: [{ FieldName: 'ID', Value: 'def-456' }] }
        ];

        // Generates: (ID = 'abc-123') OR (ID = 'def-456')
        const records = await this.GetRecordsByEntityID(entityId, keys);
    }
}

API Reference

TextChunker (Static Methods)

Method	Parameters	Returns	Description
`ChunkText`	`params: ChunkTextParams`	`TextChunk[]`	Split text into token-bounded chunks using the specified strategy
`EstimateTokenCount`	`text: string`	`number`	Fast token count approximation (~4 chars/token)

TextExtractor (Static Methods)

Method	Parameters	Returns	Description
`ExtractFromHTML`	`html: string`	`string`	Strip tags, decode entities, normalize whitespace
`ExtractFromPlainText`	`text: string`	`string`	Remove control characters, normalize whitespace
`ExtractByMimeType`	`content: string, mimeType: string`	`string`	Route to the appropriate extraction method by MIME type
`TruncateToTokenLimit`	`text: string, maxTokens: number`	`string`	Truncate at whitespace boundary within the token budget

VectorBase (Protected Methods for Subclasses)

Method	Returns	Description
`GetRecordsByEntityID(entityID, recordIDs?)`	`Promise<BaseEntity[]>`	Load entity records, optionally filtered by composite keys
`PageRecordsByEntityID<T>(params)`	`Promise<T[]>`	Paginated retrieval with configurable page size and filter
`GetAIModel(id?)`	`MJAIModelEntityExtended`	Locate an embedding model by ID or get the first available
`GetVectorDatabase(id?)`	`MJVectorDatabaseEntity`	Locate a vector database by ID or get the first available
`RunViewForSingleValue<T>(entityName, filter)`	`Promise<T \| null>`	Query for a single entity record matching a filter
`SaveEntity(entity)`	`Promise<boolean>`	Save a BaseEntity with CurrentUser context applied
`BuildExtraFilter(compositeKeys)`	`string`	Convert CompositeKey array to a SQL filter string

Interfaces

Interface	Methods	Purpose
`IEmbedding`	`createEmbedding`, `createBatchEmbedding`	Text embedding generation
`IVectorDatabase`	`listIndexes`, `createIndex`, `deleteIndex`, `editIndex`	Vector database management
`IVectorIndex`	`createRecord(s)`, `getRecord(s)`, `updateRecord(s)`, `deleteRecord(s)`	Vector record CRUD

Package Ecosystem

Package	Depends On Core	Purpose
`@memberjunction/ai-vectordb`	No (peer)	Abstract vector database interface
`@memberjunction/ai-vector-sync`	Yes	Entity-to-vector synchronization
`@memberjunction/ai-vector-dupe`	Yes	Duplicate detection via vector similarity
`@memberjunction/ai-vectors-memory`	No	In-memory vector search and clustering
`@memberjunction/ai-vectors-pinecone`	No	Pinecone implementation of VectorDBBase

Development

# Build
npm run build

# Run tests
npm run test

# Watch mode
npm run test:watch

License

ISC