Package Exports

@mdxdb/payload
@mdxdb/payload/embedding/processor
@mdxdb/payload/git/metadata
@mdxdb/payload/node
@mdxdb/payload/search/queries
@mdxdb/payload/utilities/isFileCollection

Readme

@mdxdb/payload

A hybrid database adapter for Payload CMS that combines human-readable MDX files with powerful ClickHouse querying. Version your content with Git while enjoying full-text search, vector similarity, and SQL-level query performance.

Why mdxdb?

Traditional DB	mdxdb
Opaque binary storage	Human-readable `.mdx` files
Vendor lock-in	Plain files, zero lock-in
Complex backup/restore	Git push/pull
Limited history	Full Git history with blame
Review in custom UI	Review in GitHub PRs

Best of both worlds: Content lives as files you can read, edit, and version with Git. Queries run on ClickHouse for SQL-level performance with full-text and vector search.

Features

Storage & Versioning

Human-readable MDX - Documents stored as .mdx files with YAML frontmatter
Git-native - Full history, branching, blame, and PR workflows
GitHub-friendly - Edit content directly on GitHub, review changes in PRs

Query Performance

ClickHouse-powered - Automatic local server for fast SQL queries
Full-text search - Inverted indexes with relevance scoring
Vector similarity - HNSW indexes for semantic search
Smart chunking - Markdown-aware content splitting for search

AI-Ready

Embedding pipeline - Generate embeddings with Workers AI (@cf/baai/bge-m3)
1024-dimension vectors - State-of-the-art semantic search
Background processing - Non-blocking embedding generation

Architecture

Dual storage - Content collections: files + ClickHouse
Database-only mode - Auth & internal collections skip files entirely
Namespace support - Multi-tenant and cross-app search
Git metadata - Track author, commit hash, and message per document

Installation

npm install @mdxdb/payload
# or
pnpm add @mdxdb/payload
# or
yarn add @mdxdb/payload

ClickHouse is downloaded automatically on first run (~~50MB, cached in `~~/.mdxdb`).

Quick Start

// payload.config.ts
import { buildConfig } from 'payload'
import { mdxdbAdapter } from '@mdxdb/payload'

export default buildConfig({
  db: mdxdbAdapter({
    basePath: './content',
  }),
  collections: [
    {
      slug: 'posts',
      fields: [
        { name: 'title', type: 'text', required: true },
        { name: 'status', type: 'select', options: ['draft', 'published'] },
        { name: 'content', type: 'richText' },
      ],
    },
  ],
})

This creates documents like:

content/
  posts/
    my-first-post.mdx
    another-post.mdx

Configuration

mdxdbAdapter({
  // Required: Base directory for content files
  basePath: './content',

  // Optional: ClickHouse data directory (default: ~/.mdxdb/data)
  clickhousePath: '~/.mdxdb/data',

  // Optional: ClickHouse server port (default: 9000)
  clickhousePort: 9000,

  // Optional: Database name (default: 'mdxdb')
  database: 'mdxdb',

  // Optional: Namespace for multi-tenancy (default: 'default')
  ns: 'my-app',

  // Optional: Table name prefix (default: 'mdxdb')
  tablePrefix: 'mdxdb',

  // Optional: Embedding configuration
  embedding: {
    // Workers AI account ID
    accountId: process.env.CLOUDFLARE_ACCOUNT_ID,
    // Workers AI API token
    apiToken: process.env.CLOUDFLARE_API_TOKEN,
    // Model (default: '@cf/baai/bge-m3')
    model: '@cf/baai/bge-m3',
    // Batch size (default: 100)
    batchSize: 100,
  },
})

Storage Model

Content Collections (Default)

Content collections use dual storage:

MDX files - Source of truth, human-readable, Git-versioned
ClickHouse - Query index for fast search and filtering

Database-Only Collections

For collections that don't need file storage (e.g., analytics, logs, sessions), use dbOnly: true:

mdxdbAdapter({
  basePath: './content',
  collections: {
    // These collections are stored in ClickHouse only, no MDX files
    analytics: { dbOnly: true },
    sessions: { dbOnly: true },
    logs: { dbOnly: true },
  },
})

Automatically database-only:

Auth collections (with auth: true)
Internal Payload collections (payload-*)

content/
  posts/
    hello-world.mdx      # File storage
    my-second-post.mdx
  pages/
    about.mdx

Auth & Internal Collections

Auth collections (users, etc.) and internal Payload collections (payload-*) are stored in ClickHouse only - never written to files:

Passwords are hashed, not stored in plaintext
Verification tokens and reset tokens are database-only
No sensitive data in your Git history

MDX File Format

Documents are stored as MDX files with YAML frontmatter:

---
id: my-post
status: published
publishedAt: 2024-01-15T10:30:00Z
views: 1234
createdAt: 2024-01-10T08:00:00Z
updatedAt: 2024-01-15T10:30:00Z
---

# My Post Title

This is the main content of the post. It supports **markdown** formatting,
including code blocks, lists, and more.

## Code Example

```typescript
export function hello() {
  console.log('Hello, world!')
}
```


### Field Mapping

| Payload Field Type | MDX Representation |
|-------------------|-------------------|
| text, number, email, etc. | YAML frontmatter |
| richText | Markdown content after `# Title` |
| date, point, json | YAML frontmatter (serialized) |
| relationship | ID reference in frontmatter |
| array, blocks | YAML array in frontmatter |

## Querying

### Standard Payload Queries

All Payload query operators work as expected:

```typescript
// Find published posts
const posts = await payload.find({
  collection: 'posts',
  where: {
    status: { equals: 'published' },
    views: { greater_than: 100 },
  },
  sort: '-publishedAt',
  limit: 10,
})

Full-Text Search

Search across document content:

import { fullTextSearch } from '@mdxdb/payload/search/queries'

const results = await fullTextSearch({
  client: adapter.client,
  database: adapter.database,
  tableName: adapter.tables.search,
  query: 'typescript react hooks',
  collection: 'posts', // optional: filter by collection
  ns: 'default', // optional: filter by namespace
  limit: 20,
})

// Returns: [{ id, docId, collection, content, path, score }]

Vector Similarity Search

Find semantically similar content:

import { vectorSearch } from '@mdxdb/payload/search/queries'

const results = await vectorSearch({
  client: adapter.client,
  database: adapter.database,
  tableName: adapter.tables.search,
  embedding: queryVector, // 1024-dimension float array
  collection: 'posts',
  limit: 10,
})

// Returns: [{ id, docId, collection, content, path, score }]

Search & Embeddings

How It Works

Document Creation/Update - Content is chunked at heading boundaries
Chunk Storage - Each chunk stored with path hierarchy (e.g., posts > Introduction > Getting Started)
Embedding Generation - Background process generates vectors via Workers AI
Search - Full-text via inverted index, semantic via HNSW vector index

Chunk Structure

interface SearchChunk {
  id: string // "{docId}_{chunkIndex}"
  ns: string // Namespace
  collection: string // Collection slug
  docId: string // Parent document ID
  chunkIndex: number // Position in document
  path: string // Heading hierarchy
  content: string // Chunk text (max ~1500 tokens)
  embedding: number[] // 1024-dim vector (when ready)
  status: 'pending' | 'ready' | 'failed'
}

Processing Pending Embeddings

import { processPendingChunks } from '@mdxdb/payload/embedding/processor'

// Process all pending chunks
await processPendingChunks({
  client: adapter.client,
  database: adapter.database,
  tableName: adapter.tables.search,
  embeddingConfig: adapter.embeddingConfig,
  batchSize: 50,
})

ClickHouse Schema

Data Table

CREATE TABLE mdxdb_data (
  ns String,                    -- Namespace
  collection String,            -- Collection slug
  id String,                    -- Document ID
  data String,                  -- JSON document data
  filepath Nullable(String),    -- File path (null for DB-only)
  gitHash Nullable(String),     -- Last commit hash
  gitAuthor Nullable(String),   -- Last commit author
  gitDate Nullable(DateTime64), -- Last commit date
  gitMessage Nullable(String),  -- Last commit message
  v DateTime64(3),              -- Version timestamp
  deletedAt Nullable(DateTime64(3))
) ENGINE = ReplacingMergeTree(v)
ORDER BY (ns, collection, id)

Search Table

CREATE TABLE mdxdb_search (
  -- ... chunk fields ...
  embedding Array(Float32),     -- 1024-dim vector
  INDEX idx_fts content TYPE full_text,
  INDEX idx_vec embedding TYPE vector_similarity('hnsw', 'cosineDistance')
) ENGINE = ReplacingMergeTree(updatedAt)
ORDER BY (ns, collection, docId, chunkIndex)

Migrations

Create data transformation migrations (MongoDB-style, not SQL schema):

// migrations/20240115_add_default_status.ts
import type { MigrateUpArgs, MigrateDownArgs } from '@mdxdb/payload'

export async function up({ payload }: MigrateUpArgs): Promise<void> {
  const { docs } = await payload.find({
    collection: 'posts',
    where: { status: { exists: false } },
    limit: 0,
  })

  for (const doc of docs) {
    await payload.update({
      collection: 'posts',
      id: doc.id,
      data: { status: 'draft' },
    })
  }
}

export async function down({ payload }: MigrateDownArgs): Promise<void> {
  // Reverse the migration if needed
}

Run migrations:

payload migrate           # Run pending migrations
payload migrate:down      # Roll back last batch
payload migrate:status    # Check migration status
payload migrate:fresh     # Reset and re-run all

CLI

# Sync MDX files to ClickHouse
npx mdxdb sync --path ./content

# Watch for changes and auto-sync
npx mdxdb watch --path ./content

# Rebuild search index
npx mdxdb rebuild --path ./content

# Process pending embeddings
npx mdxdb embed --batch-size 50

Directory Structure

Default organization by collection slug:

content/
  posts/
    hello-world.mdx
    my-second-post.mdx
  pages/
    about.mdx
    contact.mdx
  site-settings.mdx      # Global

With admin groups:

content/
  blog/                   # admin.group: 'Blog'
    posts/
    categories/
  settings/               # admin.group: 'Settings'
    navigation.mdx

Git Integration

Automatic Metadata

Every document tracks its Git history:

const post = await payload.findOne({
  collection: 'posts',
  where: { id: { equals: 'my-post' } },
})

// Access via ClickHouse query
// gitHash, gitAuthor, gitDate, gitMessage

Best Practices

Commit frequently - Small, focused commits for better history
Use branches - Feature branches for content changes
Review in PRs - Human-readable diffs in GitHub
Automate deploys - Push to main triggers site rebuild

Node.js Client

For Node.js environments, use the /node export for the native ClickHouse client with automatic binary installation and server management:

import { connect, startServer, stopServer, isServerRunning } from '@mdxdb/payload/node'

// Automatically downloads ClickHouse binary and starts server
const client = await connect({
  database: 'mydb',
  // Optional: custom install directory (default: ~/.mdxdb)
  installDir: '~/.mdxdb',
  // Optional: custom port (default: 8123)
  port: 8123,
})

// Query using native client
const result = await client.query({
  query: 'SELECT * FROM mydb.data LIMIT 10',
  format: 'JSONEachRow',
})

Server Management

Control the ClickHouse server lifecycle:

import {
  ensureBinary,
  getInstallPaths,
  isServerRunning,
  startServer,
  stopServer,
  waitForReady,
} from '@mdxdb/payload/node'

// Check if binary is installed, download if not
const binaryPath = await ensureBinary({ installDir: '~/.mdxdb' })

// Get installation paths
const paths = getInstallPaths({ installDir: '~/.mdxdb' })
// { binaryPath, binDir, dataDir, logDir, installDir }

// Check server status
const status = await isServerRunning({ port: 8123 })
if (!status.isRunning) {
  // Start as daemon
  await startServer({
    binaryPath: paths.binaryPath,
    dataDir: paths.dataDir,
    logDir: paths.logDir,
    httpPort: 8123,
  })
}

// Wait for server to be ready (with timeout)
await waitForReady({ port: 8123, timeout: 30000 })

// Gracefully stop server
await stopServer({ port: 8123 })

Installation Paths

ClickHouse is installed to ~/.mdxdb by default:

~/.mdxdb/
  bin/
    clickhouse          # Binary (~500MB)
  data/
    config.xml          # Auto-generated config
    ...                 # ClickHouse data files
  logs/
    clickhouse-server.log
    clickhouse-server.err.log

API Reference

Adapter Options

Option	Type	Default	Description
`basePath`	`string`	required	Base directory for content files
`collections`	`object`	`{}`	Per-collection config (see below)
`clickhousePath`	`string`	`~/.mdxdb/data`	ClickHouse data directory
`clickhousePort`	`number`	`9000`	ClickHouse server port
`database`	`string`	`'mdxdb'`	Database name
`ns`	`string`	`'default'`	Namespace for multi-tenancy
`tablePrefix`	`string`	`'mdxdb'`	Table name prefix

Collection Options:

Option	Type	Default	Description
`dbOnly`	`boolean`	`false`	Store in ClickHouse only, no MDX files
`pathPattern`	`string`	-	Custom file path pattern
`template`	`string`	-	Custom MDX template

Exported Types

import type { MdxdbAdapter, MdxdbAdapterArgs, MigrateUpArgs, MigrateDownArgs } from '@mdxdb/payload'

Exported Functions

// Node.js client with auto-install (recommended for Node.js)
import {
  connect,
  createClickHouseClient,
  ensureBinary,
  getInstallPaths,
  isServerRunning,
  startServer,
  stopServer,
  waitForReady,
} from '@mdxdb/payload/node'

// Search
import { fullTextSearch, vectorSearch } from '@mdxdb/payload/search/queries'

// Embedding
import { processPendingChunks } from '@mdxdb/payload/embedding/processor'

// Git metadata
import { getGitMetadata } from '@mdxdb/payload/git/metadata'

// Utilities
import { isFileCollection } from '@mdxdb/payload/utilities/isFileCollection'

Performance

When to Use mdxdb

Ideal for:

Content-heavy sites (blogs, docs, marketing pages)
Git-based content workflows
Teams comfortable with GitHub
Projects needing human-readable content storage
Sites with semantic search requirements

Consider alternatives for:

High-frequency writes (>100/sec)
Complex multi-document transactions
Real-time collaborative editing
Very large collections (>100k documents without pagination)

Optimization Tips

Use pagination - Don't fetch all documents at once
Index strategically - ClickHouse handles most queries well
Batch embeddings - Process in batches of 50-100
Namespace separation - Use namespaces for multi-tenant isolation

Limitations

No transactions - ClickHouse and files don't share transactional boundaries
No Payload versions - Use Git history instead of Payload's version system
Eventual consistency - File writes may briefly lag behind ClickHouse
No real-time sync - Changes require explicit sync for multi-instance setups

Troubleshooting

ClickHouse won't start

# Check if port is in use
lsof -i :9000

# Remove stale lock files
rm -rf ~/.mdxdb/data/clickhouse-server.pid

Embeddings not generating

# Check for pending chunks
npx mdxdb status

# Verify API credentials
echo $CLOUDFLARE_ACCOUNT_ID
echo $CLOUDFLARE_API_TOKEN

File/database mismatch

# Full resync from files
npx mdxdb rebuild --path ./content

Contributing

# Clone the repo
git clone https://github.com/mdxdb/payload.git

# Install dependencies
pnpm install

# Run tests
pnpm test

# Build
pnpm build

# Lint
pnpm lint

License

MIT

Built with love for content creators who believe in open, readable, versionable content.