Package Exports
- @schemaloom/core
- @schemaloom/core/server
Readme
SchemaLoom
A TypeScript library for AI-powered data extraction and schema validation with integrated web server capabilities.
Overview
SchemaLoom provides a robust framework for extracting structured data from unstructured text using AI models. The library combines the power of LangChain and Google Gemini with flexible schema validation through Zod, offering both programmatic and web-based interfaces.
Features
- AI-Powered Extraction: Leverage Google Gemini models for intelligent data extraction
- Schema Validation: Comprehensive schema definition and validation using Zod
- Flexible Chunking: Configurable text chunking with overlap control
- Multiple Interfaces: Use as a library or deploy as a web service
- Type Safety: Full TypeScript support with comprehensive type definitions
- Extensible Architecture: Support for custom schemas and extraction logic
- Production Ready: Built-in error handling, validation, and monitoring
Installation
npm install schemaloomQuick Start
Library Usage
import { SchemaLoomExtractor, GeminiProvider, EventListSchema } from 'schemaloom';
// Initialize extractor with custom configuration
const extractor = new SchemaLoomExtractor({
chunkSize: 100000,
chunkOverlap: 0,
temperature: 0
});
// Configure AI provider
const provider = new GeminiProvider({
model: "gemini-2.5-flash",
temperature: 0
});
// Extract structured data
const result = await extractor.extract(
"Your text content here...",
EventListSchema,
provider.getLLM()
);
console.log(result.data); // Extracted data
console.log(result.metadata); // Processing informationWeb Server Deployment
import { startServer } from 'schemaloom/server';
// Start server with custom configuration
startServer({
port: 3000,
host: 'localhost'
});Core Components
SchemaLoomExtractor
The primary extraction engine that handles text processing, chunking, and AI model interaction.
Configuration Options
interface ExtractionOptions {
chunkSize?: number; // Default: 100000
chunkOverlap?: number; // Default: 0
temperature?: number; // Default: 0
model?: string; // Default: "gemini-2.5-flash"
}Methods
extract<T>(text: string, schema: z.ZodSchema<T>, provider: any): Promise<ExtractionResult<T>>extractBatch<T>(chunks: TextChunk[], schema: z.ZodSchema<T>, provider: any): Promise<ExtractionResult<T[]>>getOptions(): ExtractionOptionsupdateOptions(newOptions: Partial<ExtractionOptions>): void
GeminiProvider
Manages Google Gemini AI model interactions with configurable parameters.
Configuration
interface GeminiOptions {
model?: string; // Default: "gemini-2.5-flash"
temperature?: number; // Default: 0
}Methods
getLLM(): ChatGoogleGenerativeAIupdateConfig(options: Partial<GeminiOptions>): void
Predefined Schemas
The library includes several pre-built schemas for common use cases:
Event Schema
const EventSchema = z.object({
name: z.string(),
date: z.string(),
place: z.string()
});Product Schema
const ProductSchema = z.object({
name: z.string(),
price: z.number(),
category: z.string(),
description: z.string().optional(),
brand: z.string().optional(),
sku: z.string().optional()
});Article Schema
const ArticleSchema = z.object({
title: z.string(),
author: z.string().optional(),
publishDate: z.string().optional(),
content: z.string().optional(),
tags: z.array(z.string()).optional(),
summary: z.string().optional()
});Contact Schema
const ContactSchema = z.object({
name: z.string(),
email: z.string().optional(),
phone: z.string().optional(),
company: z.string().optional(),
position: z.string().optional(),
address: z.string().optional()
});Invoice Schema
const InvoiceSchema = z.object({
invoiceNumber: z.string(),
date: z.string(),
dueDate: z.string().optional(),
customer: z.string(),
items: z.array(z.object({
description: z.string(),
quantity: z.number(),
unitPrice: z.number(),
total: z.number()
})),
subtotal: z.number(),
tax: z.number().optional(),
total: z.number()
});Web Server API
Endpoints
GET /health- Health check endpointGET /extract- Schema and parameter documentationPOST /extract- Data extraction using predefined schemasPOST /extract/custom- Data extraction using custom schema definitions
Predefined Schema Extraction
POST /extract?schema=product&chunkSize=50000&chunkOverlap=1000&temperature=0.1
Content-Type: multipart/form-data
file: [your-file]Custom Schema Extraction
POST /extract/custom
Content-Type: application/json
{
"content": "Your text content here...",
"schemaDefinition": {
"type": "object",
"properties": {
"name": {"type": "string"},
"value": {"type": "number"},
"active": {"type": "boolean"}
}
},
"chunkSize": 50000,
"temperature": 0
}Advanced Usage
Custom Pipeline Creation
import { createPipeline } from 'schemaloom/server';
import { Hono } from 'hono';
// Create base pipeline
const baseApp = createPipeline();
// Extend with custom functionality
const customApp = new Hono();
// Add custom routes
customApp.get('/status', (c) => c.json({ status: 'healthy' }));
// Mount extraction routes
customApp.route('/api', baseApp);
// Add middleware
customApp.use('*', async (c, next) => {
console.log(`${c.req.method} ${c.req.url}`);
await next();
});Batch Processing
const chunks = [
{ content: "Text chunk 1", index: 0 },
{ content: "Text chunk 2", index: 1 }
];
const result = await extractor.extractBatch(
chunks,
CustomSchema,
provider.getLLM()
);Schema Registry Access
import { SchemaRegistry } from 'schemaloom';
// Access predefined schemas
const productSchema = SchemaRegistry.product;
const articleSchema = SchemaRegistry.article;
// Use in extraction
const result = await extractor.extract(
content,
productSchema,
provider.getLLM()
);Configuration
Environment Variables
GOOGLE_API_KEY=your_google_api_key_hereServer Options
interface ServerOptions {
port?: number; // Default: 3000
host?: string; // Default: 'localhost'
cors?: boolean; // Default: false
}Error Handling
The library provides comprehensive error handling with detailed error messages and metadata:
interface ExtractionResult<T> {
data: T;
chunks: number;
processingTime: number;
errors?: string[];
}Performance Considerations
- Chunk Size: Larger chunks reduce API calls but may impact accuracy
- Chunk Overlap: Overlap preserves context between chunks
- Temperature: Lower values (0-0.3) for factual extraction, higher (0.7-1.0) for creative tasks
- Model Selection: Choose models based on accuracy vs. speed requirements
Development
# Install dependencies
npm install
# Build library
npm run build
# Development mode with watch
npm run dev
# Start server
npm startTypeScript Support
Full TypeScript support with comprehensive type definitions:
import type {
ExtractionOptions,
ExtractionResult,
TextChunk,
ExtractionProvider,
ServerOptions
} from 'schemaloom';Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
License
MIT License - see LICENSE file for details.
Support
For issues and questions, please use the GitHub issue tracker or refer to the documentation.