Package Exports

@dotdo/pg-search
@dotdo/pg-search/blob
@dotdo/pg-search/do
@dotdo/pg-search/rpc
@dotdo/pg-search/sql
@dotdo/pg-search/sync

Readme

@dotdo/pg-search

BM25 full-text search with Orama + PGlite and Cap'n Web RPC for hibernated WebSocket connections

import { OramaBlobStore } from '@dotdo/pg-search/blob'

const store = new OramaBlobStore({
  schema: { title: 'string', content: 'string', tags: 'string[]' }
})

await store.init()
await store.insert({ title: 'Hello World', content: 'Full-text search at the edge', tags: ['search'] })

const results = await store.search({ term: 'edge' })
console.log(results.documents) // [{ id: '...', title: 'Hello World', ... }]

Installation

npm install @dotdo/pg-search @dotdo/pglite @orama/orama
# or
pnpm add @dotdo/pg-search @dotdo/pglite @orama/orama

Features

BM25 Full-Text Search - Industry-standard ranking algorithm via Orama
Three Storage Patterns - Blob, Sync, and SQL patterns for different use cases
Cap'n Web RPC - Promise pipelining over hibernated WebSockets (95% cost savings)
Durable Object Integration - SearchDO class for Cloudflare Workers
Type-Safe Client - Full TypeScript support with SearchRpcClient
PGlite Persistence - Documents stored in PostgreSQL, index in Orama

Storage Patterns

This package provides three storage patterns, each optimized for different use cases:

1. Blob Pattern (`OramaBlobStore`)

Stores the entire Orama index as a serialized blob in PGlite. Best for:

Cold start scenarios where index needs to be restored
Periodic checkpointing for durability
Simple persistence without complex sync logic

import { OramaBlobStore } from '@dotdo/pg-search/blob'

const store = new OramaBlobStore({
  schema: {
    title: 'string',
    content: 'string',
    tags: 'string[]',
  },
  indexName: 'articles',
})

await store.init()

// Insert documents
const doc = await store.insert({
  title: 'TypeScript Guide',
  content: 'Learn TypeScript from scratch with practical examples...',
  tags: ['typescript', 'javascript', 'tutorial'],
})

// Search with BM25 ranking
const results = await store.search({
  term: 'TypeScript tutorial',
  limit: 10,
  hydrate: true,
})

console.log(`Found ${results.count} matches in ${results.elapsed.formatted}`)
for (const hit of results.hits) {
  console.log(`${hit.score.toFixed(2)}: ${hit.document.title}`)
}

// Checkpoint index to PGlite for persistence
await store.checkpoint()

// Close when done
await store.close()

Trade-offs:

Checkpoint is O(n) - must serialize entire index
Space efficient for small-medium indices
No partial updates - full serialization on checkpoint

2. Sync Pattern (`OramaSyncStore`)

Maintains Orama index in sync with PGlite on every write operation. Best for:

Real-time search requirements
Strong consistency guarantees
Applications where search results must reflect latest data

import { OramaSyncStore } from '@dotdo/pg-search/sync'

const store = new OramaSyncStore({
  schema: {
    title: 'string',
    body: 'string',
    author: 'string',
  },
})

await store.init()

// Every write syncs to both PGlite and Orama
await store.insert({
  title: 'Breaking News',
  body: 'Important update about...',
  author: 'John Doe',
})

// Search immediately reflects the new document
const results = await store.search({ term: 'breaking' })

// Verify consistency between index and database
const { consistent, mismatches } = await store.verify()
if (!consistent) {
  console.log('Mismatches found:', mismatches)
  await store.rebuildIndex() // Rebuild from PGlite data
}

await store.close()

Trade-offs:

Higher write latency (dual-write to PGlite and Orama)
More complex error handling with rollback support
Built-in consistency verification

3. SQL Pattern (`OramaSqlStore`)

Registers Orama search as a virtual SQL function in PGlite. Best for:

Complex queries combining full-text search with SQL joins
Existing SQL-based applications
Hybrid queries (search + filter + aggregate)

import { OramaSqlStore } from '@dotdo/pg-search/sql'

const store = new OramaSqlStore({
  schema: {
    name: 'string',
    description: 'string',
    price: 'number',
  },
  tableName: 'products',
})

await store.init()

// Insert products
await store.insertMany([
  { name: 'Laptop', description: 'Powerful laptop for developers', price: 1299 },
  { name: 'Keyboard', description: 'Mechanical keyboard with RGB', price: 149 },
  { name: 'Monitor', description: 'Ultra-wide developer monitor', price: 499 },
])

// Execute search and store results for SQL joins
const searchId = await store.executeSearch('developer', { limit: 100 })

// Join search results with other SQL operations
const results = await store.query(`
  SELECT d.data, s.score, s.rank
  FROM products d
  JOIN _orama_search_results s ON d.id = s.doc_id
  WHERE s.search_id = $1
    AND (d.data->>'price')::numeric < 1000
  ORDER BY s.rank
`, [searchId])

// Or use the convenience method
const combined = await store.searchWithSql('developer', {
  additionalSql: `AND (d.data->>'price')::numeric < 1000`,
})

await store.close()

Trade-offs:

More complex implementation
Function registration overhead
Results must be serializable to SQL types

Durable Object Integration

The SearchDO class provides a complete Durable Object implementation with both REST API and Cap'n Web RPC support.

Setting Up SearchDO

// src/index.ts
import { SearchDO } from '@dotdo/pg-search/do'

export { SearchDO }

export default {
  async fetch(request: Request, env: Env) {
    const id = env.SEARCH.idFromName('main')
    const stub = env.SEARCH.get(id)
    return stub.fetch(request)
  }
}

# wrangler.toml
[[durable_objects.bindings]]
name = "SEARCH"
class_name = "SearchDO"

[[migrations]]
tag = "v1"
new_classes = ["SearchDO"]

REST API

SearchDO exposes the following REST endpoints:

Method	Endpoint	Description
GET	`/search?q=term&limit=10`	Search documents
GET	`/documents/:id`	Get a document by ID
POST	`/documents`	Index a new document
POST	`/documents/batch`	Index multiple documents
DELETE	`/documents/:id`	Delete a document
GET	`/stats`	Get index statistics
POST	`/indexes`	Create a new search index
GET	`/health`	Health check

Cap'n Web RPC Client

For high-performance scenarios, use the typed RPC client with promise pipelining:

import { SearchRpcClient } from '@dotdo/pg-search/rpc'

interface Article {
  id: string
  title: string
  content: string
  author: string
}

const client = new SearchRpcClient<Article>('wss://search.example.com/rpc', {
  autoReconnect: true,
  maxReconnectAttempts: 5,
})

// Individual calls
const results = await client.search({ term: 'typescript' })

// Get a document
const doc = await client.get('article-123')

// Update a document
await client.update('article-123', { title: 'Updated Title' })

// Close connection when done
client.close()

Promise Pipelining (Batched Operations)

Promise pipelining dramatically reduces latency by batching multiple operations into a single round-trip:

// Without pipelining: 3 round-trips (~150ms at 50ms latency)
const doc1 = await client.index({ title: 'First', content: '...' })
const doc2 = await client.index({ title: 'Second', content: '...' })
const results = await client.search({ term: 'first' })

// With pipelining: 1 round-trip (~50ms)
const batch = client.batch()
const doc1Promise = batch.index({ title: 'First', content: '...' })
const doc2Promise = batch.index({ title: 'Second', content: '...' })
const resultsPromise = batch.search({ term: 'first' })
await batch.execute()

// All promises are now resolved
const doc1 = await doc1Promise
const doc2 = await doc2Promise
const results = await resultsPromise

Search Options

All store implementations support the same search options:

interface SearchOptions {
  /** Search term - required */
  term: string

  /** Fields to search in (default: all indexed fields) */
  properties?: string[]

  /** Maximum results to return (default: 10) */
  limit?: number

  /** Offset for pagination (default: 0) */
  offset?: number

  /** Whether to hydrate full documents from PGlite (default: true) */
  hydrate?: boolean

  /** BM25 algorithm parameters */
  bm25?: {
    k?: number   // Term frequency saturation (default: 1.2)
    b?: number   // Length normalization (default: 0.75)
    d?: number   // TF normalization (default: 0.5)
  }
}

Search Examples

// Basic search
const results = await store.search({ term: 'javascript' })

// Search specific fields
const results = await store.search({
  term: 'tutorial',
  properties: ['title', 'tags'],
})

// Pagination
const page2 = await store.search({
  term: 'programming',
  limit: 20,
  offset: 20,
})

// Just get IDs and scores (skip hydration for performance)
const results = await store.search({
  term: 'react',
  hydrate: false,
})

// Custom BM25 parameters
const results = await store.search({
  term: 'typescript',
  bm25: {
    k: 1.5,   // Increase term frequency importance
    b: 0.5,   // Reduce document length normalization
  },
})

Understanding Full-Text Search: BM25 vs tsvector/tsquery

PostgreSQL Native: tsvector/tsquery

PostgreSQL provides built-in full-text search using tsvector (document representation) and tsquery (query representation):

-- Create a tsvector from text
SELECT to_tsvector('english', 'The quick brown fox jumps over the lazy dog');
-- Result: 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2

-- Search using tsquery
SELECT * FROM documents
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'quick & fox');

How tsvector works:

Normalizes text (lowercasing, stemming)
Removes stop words ("the", "a", "is")
Creates position-aware tokens
Supports language-specific configurations

tsquery operators:

& (AND): 'cat & dog' - both terms required
| (OR): 'cat | dog' - either term
! (NOT): 'cat & !dog' - cat but not dog
<-> (FOLLOWED BY): 'quick <-> fox' - phrase search

-- Create a GIN index for fast searches
CREATE INDEX idx_content_fts ON documents USING GIN(to_tsvector('english', content));

-- Rank results using ts_rank
SELECT title, ts_rank(to_tsvector('english', content), query) AS rank
FROM documents, to_tsquery('english', 'typescript & tutorial') AS query
WHERE to_tsvector('english', content) @@ query
ORDER BY rank DESC;

Orama BM25 (This Package)

BM25 (Best Matching 25) is a probabilistic ranking algorithm used by search engines like Elasticsearch and Lucene:

// BM25 search with Orama
const results = await store.search({
  term: 'typescript tutorial',
  bm25: {
    k: 1.2,   // Term frequency saturation
    b: 0.75,  // Document length normalization
  },
})

BM25 formula components:

TF (Term Frequency): How often the term appears, with saturation (diminishing returns)
IDF (Inverse Document Frequency): Rare terms weighted more heavily
Document Length: Normalized against average document length

Comparison: tsvector/tsquery vs BM25

Feature	tsvector/tsquery	BM25 (Orama)
Ranking	Basic `ts_rank`	Probabilistic, tunable
Setup	Built into PostgreSQL	Requires Orama library
Boolean queries	Native (`&`, `	`,` !`)
Phrase search	Native (`<->`)	Supported
Relevance tuning	Weight classes (A-D)	k, b, d parameters
Stemming	Language-specific dictionaries	Automatic
Index storage	In database (GIN/GiST)	In-memory + checkpoint
Query flexibility	SQL integration	JavaScript API

Why This Package Uses BM25

Better Relevance Ranking: BM25's probabilistic model typically produces more relevant results than ts_rank

Tunable Parameters: Easily adjust ranking behavior for your use case:

// Favor exact matches over frequency
bm25: { k: 0.5, b: 0.3 }

// Favor comprehensive documents
bm25: { k: 1.5, b: 0.9 }

In-Memory Speed: Orama searches are typically sub-millisecond, while SQL queries have overhead
JavaScript Integration: Native TypeScript/JavaScript API without SQL string building

When to Use tsvector/tsquery Instead

Use PostgreSQL's native FTS when you need:

SQL Integration: Complex JOINs with search results
No Additional Dependencies: Pure PostgreSQL solution
Stored Procedures: Database-side search logic
Large Datasets: When Orama's in-memory index becomes too large

You can use the SQL pattern (OramaSqlStore) to get both:

// Combine Orama BM25 ranking with PostgreSQL queries
const searchId = await store.executeSearch('typescript', { limit: 100 })

const results = await store.query(`
  SELECT d.data, s.score
  FROM documents d
  JOIN _orama_search_results s ON d.id = s.doc_id
  WHERE s.search_id = $1
    AND d.data @> '{"category": "tutorials"}'
  ORDER BY s.rank
`, [searchId])

BM25 Full-Text Search vs Vector Similarity Search

Understanding when to use full-text search (FTS) versus vector similarity search is crucial for building effective search systems.

Full-Text Search (BM25)

BM25 (Best Matching 25) is a probabilistic ranking function that scores documents based on:

Term Frequency (TF): How often search terms appear in the document
Inverse Document Frequency (IDF): How rare the term is across all documents
Document Length Normalization: Prevents long documents from being unfairly favored

Strengths:

Exact keyword matching - "TypeScript" matches "TypeScript"
Interpretable results - you can explain why a document matched
Fast indexing - no ML model required
Low latency - typically <1ms for search
Great for structured queries - "error AND production"

Use Cases:

Documentation search
E-commerce product search with specific terms
Log search
Code search
When users search for exact terms

// BM25 search - exact keyword matching
const results = await store.search({ term: 'TypeScript React' })
// Matches documents containing "TypeScript" and/or "React"

Vector Similarity Search (pgvector)

Vector search uses embeddings (dense numerical representations) to find semantically similar content.

Strengths:

Semantic understanding - "automobile" matches "car"
Handles typos and variations naturally
Multilingual search without translation
Great for natural language queries
Finds conceptually similar content

Use Cases:

Semantic search / Q&A
Recommendation systems
Image/audio similarity
When users describe what they want in natural language

-- Vector search - semantic matching
SELECT * FROM items
ORDER BY embedding <=> query_embedding
LIMIT 10;
-- Finds conceptually similar items even without keyword overlap

Comparison Table

Aspect	BM25 (FTS)	Vector Search
Matching	Exact keywords	Semantic similarity
Query type	Keywords, boolean operators	Natural language
Speed	<1ms	1-10ms (depends on index)
Index size	Small (~10% of data)	Large (embeddings ~3KB per doc)
Interpretability	High (keyword hits)	Low (similarity scores)
Setup	Simple	Requires embedding model
Typo tolerance	Low (needs stemming)	High
Cost	CPU only	CPU + GPU for embeddings

Hybrid Search

For many applications, combining both approaches yields the best results:

interface HybridSearchOptions {
  term: string
  vector?: number[]
  bm25Weight?: number   // 0-1, default: 0.5
  vectorWeight?: number // 0-1, default: 0.5
}

// Hybrid search combines keyword relevance with semantic understanding
const results = await store.hybridSearch({
  term: 'react state management',
  vector: await embed('react state management'),
  bm25Weight: 0.4,
  vectorWeight: 0.6,
})

When to Use Each

Scenario	Recommendation
User searches "error 404 not found"	BM25 - exact error codes matter
User searches "how to fix page not loading"	Vector - semantic intent
Product search "blue nike shoes size 10"	BM25 - specific attributes
"Something similar to this article"	Vector - conceptual similarity
Code search "useEffect cleanup"	BM25 - exact function names
Documentation "how does auth work"	Hybrid - both keyword and concept

Performance Tips

Indexing

// Use insertMany for bulk operations
await store.insertMany(documents) // 10x faster than individual inserts

// Checkpoint periodically, not after every write
let writeCount = 0
for (const doc of documents) {
  await store.insert(doc)
  if (++writeCount % 100 === 0) {
    await store.checkpoint()
  }
}

Schema Design

// Index only searchable fields
const store = new OramaBlobStore({
  schema: {
    title: 'string',      // Searchable
    content: 'string',    // Searchable
    // Don't index: id, timestamps, relations
  }
})

// Store full document data in PGlite, search fields in Orama

Search Optimization

// Skip hydration when you only need IDs/scores
const results = await store.search({
  term: 'query',
  hydrate: false,  // 2-3x faster
})

// Limit fields to search
const results = await store.search({
  term: 'query',
  properties: ['title'],  // Don't search large content field
})

// Use pagination instead of large limits
const page1 = await store.search({ term: 'query', limit: 20 })
const page2 = await store.search({ term: 'query', limit: 20, offset: 20 })

Memory Management

// Close stores when done to free memory
await store.close()

// For Durable Objects, checkpoint before hibernation
export class MySearchDO {
  async alarm() {
    await this.store.checkpoint()
  }
}

API Reference

OramaBlobStore

Method	Description
`init()`	Initialize database and index
`insert(doc)`	Insert a document
`insertMany(docs)`	Insert multiple documents
`update(id, partial)`	Update a document
`delete(id)`	Delete a document
`search(options)`	Search documents
`get(id)`	Get document by ID
`getAll()`	Get all documents
`getStats()`	Get index statistics
`checkpoint(options?)`	Save index to PGlite
`restore(options?)`	Restore index from PGlite
`close()`	Close connections

OramaSyncStore

Same as OramaBlobStore plus:

Method	Description
`rebuildIndex()`	Rebuild index from PGlite data
`verify()`	Verify index/database consistency

OramaSqlStore

Same as OramaBlobStore plus:

Method	Description
`executeSearch(term, options?)`	Execute search and return ID for SQL joins
`searchWithSql(term, options?)`	Search and join in one call
`query(sql, params?)`	Execute raw SQL query
`registerFunction(name, fn)`	Register custom SQL function

SearchRpcClient

Method	Description
`search(options)`	Search documents
`index(doc)`	Index a document
`indexMany(docs)`	Index multiple documents
`get(id)`	Get document by ID
`delete(id)`	Delete a document
`update(id, partial)`	Update a document
`stats()`	Get index statistics
`createIndex(config)`	Create new index
`getAll()`	Get all documents
`batch()`	Create batch for pipelining
`close()`	Close WebSocket connection

Troubleshooting

"OramaStore not initialized"

Always call init() before using the store:

const store = new OramaBlobStore({ schema })
await store.init() // Required!
await store.search({ term: 'query' })

"No checkpoint found for index"

The index hasn't been checkpointed yet. Use failSilently option:

await store.restore({ failSilently: true })

Search returns no results

Check if documents are indexed:

const stats = await store.getStats()
console.log('Document count:', stats.documentCount)

Verify schema matches document structure:

// Schema must include all searchable fields
const store = new OramaBlobStore({
  schema: {
    title: 'string',  // Must match document property names
    content: 'string',
  }
})

Check search term:

// Orama uses word tokenization - single characters may not match
await store.search({ term: 'a' }) // May not work
await store.search({ term: 'apple' }) // Works

WebSocket connection fails

const client = new SearchRpcClient('wss://...', {
  autoReconnect: true,
  maxReconnectAttempts: 5,
  reconnectDelay: 1000,
})

Memory issues in Workers

The Orama index lives in memory. For large indices:

Use pagination instead of large result sets
Consider sharding across multiple Durable Objects
Use the SQL pattern for very large datasets

@dotdo/postgres - PostgreSQL in Cloudflare Durable Objects
@dotdo/pglite - PGlite fork optimized for Cloudflare Workers
@orama/orama - Full-text search engine

License

MIT

@dotdo/pg-search

Package Exports

Readme

@dotdo/pg-search

Installation

Features

Storage Patterns

1. Blob Pattern (OramaBlobStore)

2. Sync Pattern (OramaSyncStore)

3. SQL Pattern (OramaSqlStore)

Durable Object Integration

Setting Up SearchDO

REST API

Cap'n Web RPC Client

Promise Pipelining (Batched Operations)

Search Options

Search Examples

Understanding Full-Text Search: BM25 vs tsvector/tsquery

PostgreSQL Native: tsvector/tsquery

Orama BM25 (This Package)

Comparison: tsvector/tsquery vs BM25

Why This Package Uses BM25

When to Use tsvector/tsquery Instead

BM25 Full-Text Search vs Vector Similarity Search

Full-Text Search (BM25)

Vector Similarity Search (pgvector)

Comparison Table

Hybrid Search

When to Use Each

Performance Tips

Indexing

Schema Design

Search Optimization

Memory Management

API Reference

OramaBlobStore

OramaSyncStore

OramaSqlStore

SearchRpcClient

Troubleshooting

"OramaStore not initialized"

"No checkpoint found for index"

Search returns no results

WebSocket connection fails

Memory issues in Workers

Related Packages

Links

License

1. Blob Pattern (`OramaBlobStore`)

2. Sync Pattern (`OramaSyncStore`)

3. SQL Pattern (`OramaSqlStore`)