JSPM

@memberjunction/ai-vectors

0.9.4
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 1663
  • Score
    100M100P100Q175760F
  • License ISC

MemberJunction: AI Vectors Module

Package Exports

  • @memberjunction/ai-vectors
  • @memberjunction/ai-vectors/dist/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@memberjunction/ai-vectors) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

@memberjunction/ai-vectors

Core foundation package for vector operations in MemberJunction. Provides text processing utilities (chunking, extraction), base classes for vectorization pipelines, and interfaces for embedding providers and vector databases.

Installation

npm install @memberjunction/ai-vectors

What's Included

Export Type Purpose
TextChunker Class Token-aware text splitting with sentence, paragraph, and fixed strategies
TextExtractor Class HTML stripping, entity decoding, MIME-type routing, token truncation
VectorBase Class Base class providing RunView, Metadata, AIEngine integration for subclasses
IEmbedding Interface Contract for single and batch text embedding generation
IVectorDatabase Interface Contract for vector database management (create/delete/list indexes)
IVectorIndex Interface Contract for CRUD operations on vector records within an index
ChunkTextParams Type Configuration for TextChunker.ChunkText()
TextChunk Type Output chunk with text, offsets, token count, and index
PageRecordsParams Type Paginated entity record retrieval configuration

Architecture

graph TD
    subgraph Core["@memberjunction/ai-vectors"]
        TC["TextChunker"]
        TE["TextExtractor"]
        VB["VectorBase"]
        IE["IEmbedding"]
        IVD["IVectorDatabase"]
        IVI["IVectorIndex"]
    end

    subgraph MJCore["MemberJunction Core"]
        MD["Metadata"]
        RV["RunView"]
        BE["BaseEntity"]
    end

    subgraph AIEngine["AI Engine"]
        AIM["AIEngine.Instance"]
        MOD["Embedding Models"]
        VDB["Vector Databases"]
    end

    subgraph Consumers["Consumer Packages"]
        SYNC["ai-vector-sync"]
        DUPE["ai-vector-dupe"]
    end

    VB --> MD
    VB --> RV
    VB --> BE
    VB --> AIM
    AIM --> MOD
    AIM --> VDB
    SYNC --> VB
    SYNC --> TC
    SYNC --> TE
    DUPE --> VB

    style Core fill:#2d6a9f,stroke:#1a4971,color:#fff
    style MJCore fill:#2d8659,stroke:#1a5c3a,color:#fff
    style AIEngine fill:#b8762f,stroke:#8a5722,color:#fff
    style Consumers fill:#7c5295,stroke:#563a6b,color:#fff

TextChunker

Token-aware text splitting that respects natural language boundaries. All methods are static.

Strategies

Strategy Splits On Best For
sentence Sentence-ending punctuation (. ! ?) Prose, articles, descriptions
paragraph Double newlines (\n\n) Structured documents, Markdown, reports
fixed Whitespace boundaries at the character limit Logs, code, unstructured data

Basic Usage

import { TextChunker, ChunkTextParams, TextChunk } from '@memberjunction/ai-vectors';

const article = `Machine learning models require training data.
The quality of training data directly impacts model performance.
Data preprocessing is a critical step in any ML pipeline.

Feature engineering transforms raw data into meaningful representations.
Good features can dramatically improve model accuracy.`;

// Sentence strategy (default)
const chunks: TextChunk[] = TextChunker.ChunkText({
    Text: article,
    MaxChunkTokens: 128,
    Strategy: 'sentence'
});

for (const chunk of chunks) {
    console.log(`Chunk ${chunk.Index}: ${chunk.TokenCount} tokens, offset ${chunk.StartOffset}-${chunk.EndOffset}`);
    console.log(chunk.Text);
}

Paragraph Strategy

const markdownDoc = `## Introduction

This document covers the architecture of our data pipeline.
It handles ingestion, transformation, and storage.

## Processing

Records are validated against schema constraints.
Invalid records are routed to a dead-letter queue.

## Storage

Processed data is stored in both relational and vector databases.
Vector embeddings enable semantic search across all records.`;

const chunks = TextChunker.ChunkText({
    Text: markdownDoc,
    MaxChunkTokens: 256,
    Strategy: 'paragraph'
});
// Each paragraph becomes a chunk (or paragraphs merge if they fit together)

Fixed Strategy

const logData = `2024-01-15T10:00:00Z INFO Server started on port 4000
2024-01-15T10:00:01Z INFO Connected to database
2024-01-15T10:00:02Z WARN High memory usage detected: 85%
2024-01-15T10:00:03Z ERROR Connection timeout after 30000ms`;

const chunks = TextChunker.ChunkText({
    Text: logData,
    MaxChunkTokens: 64,
    Strategy: 'fixed'
});

Configuring Overlap

Overlap repeats trailing content from the previous chunk at the start of the next chunk, preserving context across chunk boundaries. Defaults to 10% of MaxChunkTokens.

// Explicit overlap: 50 tokens of shared context between chunks
const chunks = TextChunker.ChunkText({
    Text: longDocument,
    MaxChunkTokens: 512,
    OverlapTokens: 50,
    Strategy: 'sentence'
});

// No overlap
const chunks = TextChunker.ChunkText({
    Text: longDocument,
    MaxChunkTokens: 512,
    OverlapTokens: 0,
    Strategy: 'sentence'
});

Token Estimation

EstimateTokenCount provides a fast approximation using the ~4 characters per token heuristic for English text. This is suitable for chunking where exact counts are not critical.

const tokens = TextChunker.EstimateTokenCount('This is a sample sentence.');
// Returns: 7 (26 characters / 4)

// For production accuracy with specific models, use tiktoken directly
// and pass the result to MaxChunkTokens for precise control

TextChunk Output Shape

Each chunk includes full position metadata for traceability back to the source:

interface TextChunk {
    Text: string;        // The chunk text content
    StartOffset: number; // Start character offset in original text
    EndOffset: number;   // End character offset (exclusive)
    TokenCount: number;  // Approximate token count
    Index: number;       // 0-based chunk index
}

TextExtractor

Static utilities for extracting clean plain text from various content formats. Dependency-light (regex-based, no DOM parser required).

HTML Extraction

import { TextExtractor } from '@memberjunction/ai-vectors';

const html = `
<html>
<head><style>body { color: red; }</style></head>
<body>
  <h1>Welcome</h1>
  <p>This is a <strong>formatted</strong> paragraph with &amp; entities.</p>
  <script>alert('removed');</script>
  <ul>
    <li>Item one</li>
    <li>Item two</li>
  </ul>
</body>
</html>`;

const text = TextExtractor.ExtractFromHTML(html);
// "Welcome\nThis is a formatted paragraph with & entities.\nItem one\nItem two"

What it does:

  • Removes <script> and <style> elements entirely
  • Converts block-level elements (<p>, <div>, <h1>-<h6>, <li>, <br>, etc.) to newlines
  • Strips all remaining HTML tags
  • Decodes named entities (&amp;, &lt;, &gt;, &quot;, &nbsp;, &mdash;, &hellip;, etc.)
  • Decodes numeric entities (decimal &#169; and hex &#xA9;)
  • Normalizes whitespace (collapses runs of spaces, limits consecutive newlines to 2)

Plain Text Normalization

const raw = "  Some text\x00with\x07control\x1Fcharacters\n\n\n\n\nand  extra   spaces  ";
const clean = TextExtractor.ExtractFromPlainText(raw);
// "Some textwithcontrolcharacters\n\nand extra spaces"

Removes control characters (\x00-\x1F except \n and \t), normalizes whitespace, trims.

MIME-Type Routing

// Automatically selects the right extraction method
const fromHTML = TextExtractor.ExtractByMimeType(htmlContent, 'text/html');
const fromPlain = TextExtractor.ExtractByMimeType(plainContent, 'text/plain');
const fromCSV = TextExtractor.ExtractByMimeType(csvContent, 'text/csv');  // Falls back to plain text

// For binary formats (PDF, DOCX), extract text with a dedicated library first,
// then pass through ExtractFromPlainText for normalization:
// const pdfText = await pdfParse(buffer);
// const clean = TextExtractor.ExtractFromPlainText(pdfText);

Token Truncation

// Truncate text to fit within a model's context window
const truncated = TextExtractor.TruncateToTokenLimit(veryLongText, 8192);
// Truncates at the last whitespace boundary before the estimated character limit

VectorBase

Abstract base class that downstream vector packages extend. Provides integrated access to MemberJunction's Metadata, RunView, and AIEngine systems.

Class Diagram

classDiagram
    class VectorBase {
        +Metadata : Metadata
        +RunView : RunView
        +CurrentUser : UserInfo
        #GetRecordsByEntityID(entityID, recordIDs?) BaseEntity[]
        #PageRecordsByEntityID~T~(params) T[]
        #GetAIModel(id?) MJAIModelEntityExtended
        #GetVectorDatabase(id?) MJVectorDatabaseEntity
        #RunViewForSingleValue~T~(entityName, filter) T | null
        #SaveEntity(entity) boolean
        #BuildExtraFilter(compositeKeys) string
    }

Extending VectorBase

import { VectorBase, PageRecordsParams } from '@memberjunction/ai-vectors';
import { BaseEntity } from '@memberjunction/core';

export class MyVectorProcessor extends VectorBase {
    async ProcessEntity(entityId: string): Promise<void> {
        // Load all records for an entity
        const records = await this.GetRecordsByEntityID(entityId);

        // Access configured AI models and vector databases
        const model = this.GetAIModel();       // First available embedding model
        const vectorDb = this.GetVectorDatabase(); // First available vector DB

        for (const record of records) {
            // Generate embeddings, upsert into vector DB
        }
    }

    async ProcessInPages(entityId: string): Promise<void> {
        let page = 1;
        let hasMore = true;

        while (hasMore) {
            const records = await this.PageRecordsByEntityID<Record<string, unknown>>({
                EntityID: entityId,
                PageNumber: page,
                PageSize: 100,
                ResultType: 'simple',
                Filter: "Status = 'Active'"
            });
            hasMore = records.length === 100;
            page++;
        }
    }
}

Filtering with Composite Keys

import { VectorBase } from '@memberjunction/ai-vectors';
import { CompositeKey } from '@memberjunction/core';

class FilteredProcessor extends VectorBase {
    async GetSpecificRecords(entityId: string): Promise<void> {
        const keys: CompositeKey[] = [
            { KeyValuePairs: [{ FieldName: 'ID', Value: 'abc-123' }] },
            { KeyValuePairs: [{ FieldName: 'ID', Value: 'def-456' }] }
        ];

        // Generates: (ID = 'abc-123') OR (ID = 'def-456')
        const records = await this.GetRecordsByEntityID(entityId, keys);
    }
}

API Reference

TextChunker (Static Methods)

Method Parameters Returns Description
ChunkText params: ChunkTextParams TextChunk[] Split text into token-bounded chunks using the specified strategy
EstimateTokenCount text: string number Fast token count approximation (~4 chars/token)

TextExtractor (Static Methods)

Method Parameters Returns Description
ExtractFromHTML html: string string Strip tags, decode entities, normalize whitespace
ExtractFromPlainText text: string string Remove control characters, normalize whitespace
ExtractByMimeType content: string, mimeType: string string Route to the appropriate extraction method by MIME type
TruncateToTokenLimit text: string, maxTokens: number string Truncate at whitespace boundary within the token budget

VectorBase (Protected Methods for Subclasses)

Method Returns Description
GetRecordsByEntityID(entityID, recordIDs?) Promise<BaseEntity[]> Load entity records, optionally filtered by composite keys
PageRecordsByEntityID<T>(params) Promise<T[]> Paginated retrieval with configurable page size and filter
GetAIModel(id?) MJAIModelEntityExtended Locate an embedding model by ID or get the first available
GetVectorDatabase(id?) MJVectorDatabaseEntity Locate a vector database by ID or get the first available
RunViewForSingleValue<T>(entityName, filter) Promise<T | null> Query for a single entity record matching a filter
SaveEntity(entity) Promise<boolean> Save a BaseEntity with CurrentUser context applied
BuildExtraFilter(compositeKeys) string Convert CompositeKey array to a SQL filter string

Interfaces

Interface Methods Purpose
IEmbedding createEmbedding, createBatchEmbedding Text embedding generation
IVectorDatabase listIndexes, createIndex, deleteIndex, editIndex Vector database management
IVectorIndex createRecord(s), getRecord(s), updateRecord(s), deleteRecord(s) Vector record CRUD

Package Ecosystem

Package Depends On Core Purpose
@memberjunction/ai-vectordb No (peer) Abstract vector database interface
@memberjunction/ai-vector-sync Yes Entity-to-vector synchronization
@memberjunction/ai-vector-dupe Yes Duplicate detection via vector similarity
@memberjunction/ai-vectors-memory No In-memory vector search and clustering
@memberjunction/ai-vectors-pinecone No Pinecone implementation of VectorDBBase

Further Reading

  • Text Processing Guide -- in-depth guide on chunking strategies, overlap tuning, HTML edge cases, and integration with vectorization/autotagging pipelines

Development

# Build
npm run build

# Run tests
npm run test

# Watch mode
npm run test:watch

License

ISC