JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 20
  • Score
    100M100P100Q57970F
  • License MIT

Hybrid database adapter for Payload CMS - MDX files + ClickHouse with full-text and vector search

Package Exports

  • @mdxdb/payload
  • @mdxdb/payload/embedding/processor
  • @mdxdb/payload/git/metadata
  • @mdxdb/payload/node
  • @mdxdb/payload/search/queries
  • @mdxdb/payload/utilities/isFileCollection

Readme

@mdxdb/payload

A hybrid database adapter for Payload CMS that combines human-readable MDX files with powerful ClickHouse querying. Version your content with Git while enjoying full-text search, vector similarity, and SQL-level query performance.

npm version License: MIT

Why mdxdb?

Traditional DB mdxdb
Opaque binary storage Human-readable .mdx files
Vendor lock-in Plain files, zero lock-in
Complex backup/restore Git push/pull
Limited history Full Git history with blame
Review in custom UI Review in GitHub PRs

Best of both worlds: Content lives as files you can read, edit, and version with Git. Queries run on ClickHouse for SQL-level performance with full-text and vector search.

Features

Storage & Versioning

  • Human-readable MDX - Documents stored as .mdx files with YAML frontmatter
  • Git-native - Full history, branching, blame, and PR workflows
  • GitHub-friendly - Edit content directly on GitHub, review changes in PRs

Query Performance

  • ClickHouse-powered - Automatic local server for fast SQL queries
  • Full-text search - Inverted indexes with relevance scoring
  • Vector similarity - HNSW indexes for semantic search
  • Smart chunking - Markdown-aware content splitting for search

AI-Ready

  • Embedding pipeline - Generate embeddings with Workers AI (@cf/baai/bge-m3)
  • 1024-dimension vectors - State-of-the-art semantic search
  • Background processing - Non-blocking embedding generation

Architecture

  • Dual storage - Content collections: files + ClickHouse
  • Database-only mode - Auth & internal collections skip files entirely
  • Namespace support - Multi-tenant and cross-app search
  • Git metadata - Track author, commit hash, and message per document

Installation

npm install @mdxdb/payload
# or
pnpm add @mdxdb/payload
# or
yarn add @mdxdb/payload

ClickHouse is downloaded automatically on first run (50MB, cached in `/.mdxdb`).

Quick Start

// payload.config.ts
import { buildConfig } from 'payload'
import { mdxdbAdapter } from '@mdxdb/payload'

export default buildConfig({
  db: mdxdbAdapter({
    basePath: './content',
  }),
  collections: [
    {
      slug: 'posts',
      fields: [
        { name: 'title', type: 'text', required: true },
        { name: 'status', type: 'select', options: ['draft', 'published'] },
        { name: 'content', type: 'richText' },
      ],
    },
  ],
})

This creates documents like:

content/
  posts/
    my-first-post.mdx
    another-post.mdx

Configuration

mdxdbAdapter({
  // Required: Base directory for content files
  basePath: './content',

  // Optional: ClickHouse data directory (default: ~/.mdxdb/data)
  clickhousePath: '~/.mdxdb/data',

  // Optional: ClickHouse server port (default: 9000)
  clickhousePort: 9000,

  // Optional: Database name (default: 'mdxdb')
  database: 'mdxdb',

  // Optional: Namespace for multi-tenancy (default: 'default')
  ns: 'my-app',

  // Optional: Table name prefix (default: 'mdxdb')
  tablePrefix: 'mdxdb',

  // Optional: Embedding configuration
  embedding: {
    // Workers AI account ID
    accountId: process.env.CLOUDFLARE_ACCOUNT_ID,
    // Workers AI API token
    apiToken: process.env.CLOUDFLARE_API_TOKEN,
    // Model (default: '@cf/baai/bge-m3')
    model: '@cf/baai/bge-m3',
    // Batch size (default: 100)
    batchSize: 100,
  },
})

Storage Model

Content Collections (Default)

Content collections use dual storage:

  • MDX files - Source of truth, human-readable, Git-versioned
  • ClickHouse - Query index for fast search and filtering

Database-Only Collections

For collections that don't need file storage (e.g., analytics, logs, sessions), use dbOnly: true:

mdxdbAdapter({
  basePath: './content',
  collections: {
    // These collections are stored in ClickHouse only, no MDX files
    analytics: { dbOnly: true },
    sessions: { dbOnly: true },
    logs: { dbOnly: true },
  },
})

Automatically database-only:

  • Auth collections (with auth: true)
  • Internal Payload collections (payload-*)
content/
  posts/
    hello-world.mdx      # File storage
    my-second-post.mdx
  pages/
    about.mdx

Auth & Internal Collections

Auth collections (users, etc.) and internal Payload collections (payload-*) are stored in ClickHouse only - never written to files:

  • Passwords are hashed, not stored in plaintext
  • Verification tokens and reset tokens are database-only
  • No sensitive data in your Git history

MDX File Format

Documents are stored as MDX files with YAML frontmatter:

---
id: my-post
status: published
publishedAt: 2024-01-15T10:30:00Z
views: 1234
createdAt: 2024-01-10T08:00:00Z
updatedAt: 2024-01-15T10:30:00Z
---

# My Post Title

This is the main content of the post. It supports **markdown** formatting,
including code blocks, lists, and more.

## Code Example

```typescript
export function hello() {
  console.log('Hello, world!')
}
```

### Field Mapping

| Payload Field Type | MDX Representation |
|-------------------|-------------------|
| text, number, email, etc. | YAML frontmatter |
| richText | Markdown content after `# Title` |
| date, point, json | YAML frontmatter (serialized) |
| relationship | ID reference in frontmatter |
| array, blocks | YAML array in frontmatter |

## Querying

### Standard Payload Queries

All Payload query operators work as expected:

```typescript
// Find published posts
const posts = await payload.find({
  collection: 'posts',
  where: {
    status: { equals: 'published' },
    views: { greater_than: 100 },
  },
  sort: '-publishedAt',
  limit: 10,
})

Search across document content:

import { fullTextSearch } from '@mdxdb/payload/search/queries'

const results = await fullTextSearch({
  client: adapter.client,
  database: adapter.database,
  tableName: adapter.tables.search,
  query: 'typescript react hooks',
  collection: 'posts', // optional: filter by collection
  ns: 'default', // optional: filter by namespace
  limit: 20,
})

// Returns: [{ id, docId, collection, content, path, score }]

Find semantically similar content:

import { vectorSearch } from '@mdxdb/payload/search/queries'

const results = await vectorSearch({
  client: adapter.client,
  database: adapter.database,
  tableName: adapter.tables.search,
  embedding: queryVector, // 1024-dimension float array
  collection: 'posts',
  limit: 10,
})

// Returns: [{ id, docId, collection, content, path, score }]

Search & Embeddings

How It Works

  1. Document Creation/Update - Content is chunked at heading boundaries
  2. Chunk Storage - Each chunk stored with path hierarchy (e.g., posts > Introduction > Getting Started)
  3. Embedding Generation - Background process generates vectors via Workers AI
  4. Search - Full-text via inverted index, semantic via HNSW vector index

Chunk Structure

interface SearchChunk {
  id: string // "{docId}_{chunkIndex}"
  ns: string // Namespace
  collection: string // Collection slug
  docId: string // Parent document ID
  chunkIndex: number // Position in document
  path: string // Heading hierarchy
  content: string // Chunk text (max ~1500 tokens)
  embedding: number[] // 1024-dim vector (when ready)
  status: 'pending' | 'ready' | 'failed'
}

Processing Pending Embeddings

import { processPendingChunks } from '@mdxdb/payload/embedding/processor'

// Process all pending chunks
await processPendingChunks({
  client: adapter.client,
  database: adapter.database,
  tableName: adapter.tables.search,
  embeddingConfig: adapter.embeddingConfig,
  batchSize: 50,
})

ClickHouse Schema

Data Table

CREATE TABLE mdxdb_data (
  ns String,                    -- Namespace
  collection String,            -- Collection slug
  id String,                    -- Document ID
  data String,                  -- JSON document data
  filepath Nullable(String),    -- File path (null for DB-only)
  gitHash Nullable(String),     -- Last commit hash
  gitAuthor Nullable(String),   -- Last commit author
  gitDate Nullable(DateTime64), -- Last commit date
  gitMessage Nullable(String),  -- Last commit message
  v DateTime64(3),              -- Version timestamp
  deletedAt Nullable(DateTime64(3))
) ENGINE = ReplacingMergeTree(v)
ORDER BY (ns, collection, id)

Search Table

CREATE TABLE mdxdb_search (
  -- ... chunk fields ...
  embedding Array(Float32),     -- 1024-dim vector
  INDEX idx_fts content TYPE full_text,
  INDEX idx_vec embedding TYPE vector_similarity('hnsw', 'cosineDistance')
) ENGINE = ReplacingMergeTree(updatedAt)
ORDER BY (ns, collection, docId, chunkIndex)

Migrations

Create data transformation migrations (MongoDB-style, not SQL schema):

// migrations/20240115_add_default_status.ts
import type { MigrateUpArgs, MigrateDownArgs } from '@mdxdb/payload'

export async function up({ payload }: MigrateUpArgs): Promise<void> {
  const { docs } = await payload.find({
    collection: 'posts',
    where: { status: { exists: false } },
    limit: 0,
  })

  for (const doc of docs) {
    await payload.update({
      collection: 'posts',
      id: doc.id,
      data: { status: 'draft' },
    })
  }
}

export async function down({ payload }: MigrateDownArgs): Promise<void> {
  // Reverse the migration if needed
}

Run migrations:

payload migrate           # Run pending migrations
payload migrate:down      # Roll back last batch
payload migrate:status    # Check migration status
payload migrate:fresh     # Reset and re-run all

CLI

# Sync MDX files to ClickHouse
npx mdxdb sync --path ./content

# Watch for changes and auto-sync
npx mdxdb watch --path ./content

# Rebuild search index
npx mdxdb rebuild --path ./content

# Process pending embeddings
npx mdxdb embed --batch-size 50

Directory Structure

Default organization by collection slug:

content/
  posts/
    hello-world.mdx
    my-second-post.mdx
  pages/
    about.mdx
    contact.mdx
  site-settings.mdx      # Global

With admin groups:

content/
  blog/                   # admin.group: 'Blog'
    posts/
    categories/
  settings/               # admin.group: 'Settings'
    navigation.mdx

Git Integration

Automatic Metadata

Every document tracks its Git history:

const post = await payload.findOne({
  collection: 'posts',
  where: { id: { equals: 'my-post' } },
})

// Access via ClickHouse query
// gitHash, gitAuthor, gitDate, gitMessage

Best Practices

  1. Commit frequently - Small, focused commits for better history
  2. Use branches - Feature branches for content changes
  3. Review in PRs - Human-readable diffs in GitHub
  4. Automate deploys - Push to main triggers site rebuild

Node.js Client

For Node.js environments, use the /node export for the native ClickHouse client with automatic binary installation and server management:

import { connect, startServer, stopServer, isServerRunning } from '@mdxdb/payload/node'

// Automatically downloads ClickHouse binary and starts server
const client = await connect({
  database: 'mydb',
  // Optional: custom install directory (default: ~/.mdxdb)
  installDir: '~/.mdxdb',
  // Optional: custom port (default: 8123)
  port: 8123,
})

// Query using native client
const result = await client.query({
  query: 'SELECT * FROM mydb.data LIMIT 10',
  format: 'JSONEachRow',
})

Server Management

Control the ClickHouse server lifecycle:

import {
  ensureBinary,
  getInstallPaths,
  isServerRunning,
  startServer,
  stopServer,
  waitForReady,
} from '@mdxdb/payload/node'

// Check if binary is installed, download if not
const binaryPath = await ensureBinary({ installDir: '~/.mdxdb' })

// Get installation paths
const paths = getInstallPaths({ installDir: '~/.mdxdb' })
// { binaryPath, binDir, dataDir, logDir, installDir }

// Check server status
const status = await isServerRunning({ port: 8123 })
if (!status.isRunning) {
  // Start as daemon
  await startServer({
    binaryPath: paths.binaryPath,
    dataDir: paths.dataDir,
    logDir: paths.logDir,
    httpPort: 8123,
  })
}

// Wait for server to be ready (with timeout)
await waitForReady({ port: 8123, timeout: 30000 })

// Gracefully stop server
await stopServer({ port: 8123 })

Installation Paths

ClickHouse is installed to ~/.mdxdb by default:

~/.mdxdb/
  bin/
    clickhouse          # Binary (~500MB)
  data/
    config.xml          # Auto-generated config
    ...                 # ClickHouse data files
  logs/
    clickhouse-server.log
    clickhouse-server.err.log

API Reference

Adapter Options

Option Type Default Description
basePath string required Base directory for content files
collections object {} Per-collection config (see below)
clickhousePath string ~/.mdxdb/data ClickHouse data directory
clickhousePort number 9000 ClickHouse server port
database string 'mdxdb' Database name
ns string 'default' Namespace for multi-tenancy
tablePrefix string 'mdxdb' Table name prefix

Collection Options:

Option Type Default Description
dbOnly boolean false Store in ClickHouse only, no MDX files
pathPattern string - Custom file path pattern
template string - Custom MDX template

Exported Types

import type { MdxdbAdapter, MdxdbAdapterArgs, MigrateUpArgs, MigrateDownArgs } from '@mdxdb/payload'

Exported Functions

// Node.js client with auto-install (recommended for Node.js)
import {
  connect,
  createClickHouseClient,
  ensureBinary,
  getInstallPaths,
  isServerRunning,
  startServer,
  stopServer,
  waitForReady,
} from '@mdxdb/payload/node'

// Search
import { fullTextSearch, vectorSearch } from '@mdxdb/payload/search/queries'

// Embedding
import { processPendingChunks } from '@mdxdb/payload/embedding/processor'

// Git metadata
import { getGitMetadata } from '@mdxdb/payload/git/metadata'

// Utilities
import { isFileCollection } from '@mdxdb/payload/utilities/isFileCollection'

Performance

When to Use mdxdb

Ideal for:

  • Content-heavy sites (blogs, docs, marketing pages)
  • Git-based content workflows
  • Teams comfortable with GitHub
  • Projects needing human-readable content storage
  • Sites with semantic search requirements

Consider alternatives for:

  • High-frequency writes (>100/sec)
  • Complex multi-document transactions
  • Real-time collaborative editing
  • Very large collections (>100k documents without pagination)

Optimization Tips

  1. Use pagination - Don't fetch all documents at once
  2. Index strategically - ClickHouse handles most queries well
  3. Batch embeddings - Process in batches of 50-100
  4. Namespace separation - Use namespaces for multi-tenant isolation

Limitations

  • No transactions - ClickHouse and files don't share transactional boundaries
  • No Payload versions - Use Git history instead of Payload's version system
  • Eventual consistency - File writes may briefly lag behind ClickHouse
  • No real-time sync - Changes require explicit sync for multi-instance setups

Troubleshooting

ClickHouse won't start

# Check if port is in use
lsof -i :9000

# Remove stale lock files
rm -rf ~/.mdxdb/data/clickhouse-server.pid

Embeddings not generating

# Check for pending chunks
npx mdxdb status

# Verify API credentials
echo $CLOUDFLARE_ACCOUNT_ID
echo $CLOUDFLARE_API_TOKEN

File/database mismatch

# Full resync from files
npx mdxdb rebuild --path ./content

Contributing

# Clone the repo
git clone https://github.com/mdxdb/payload.git

# Install dependencies
pnpm install

# Run tests
pnpm test

# Build
pnpm build

# Lint
pnpm lint

License

MIT


Built with love for content creators who believe in open, readable, versionable content.