JSPM

webscribe-mcp

2.0.1
    • ESM via JSPM
    • ES Module Entrypoint
    • Export Map
    • Keywords
    • License
    • Repository URL
    • TypeScript Types
    • README
    • Created
    • Published
    • Downloads 3
    • Score
      100M100P100Q27706F
    • License MIT

    WebScribe — The web's scribe. Crawls, chunks and semantically indexes any HTTP documentation for AI copilot agents.

    Package Exports

    • webscribe-mcp
    • webscribe-mcp/build/index.js

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (webscribe-mcp) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    WebScribe — The web's scribe, writes docs for your AI agents

    The web's scribe — writes any HTTP documentation into your AI agent's context.

    npm version Node.js version MIT License MCP v2.0 SQLite FTS5 Playwright Tests passing


    What is WebScribe?

    WebScribe is an MCP server that crawls any public documentation website and makes it instantly searchable by AI agents. Think of it as a scribe sitting between your copilot and the internet — reading docs, understanding their structure, and delivering precise, context-rich answers.

    Agent asks: "How do I add a loading spinner to an MUI Button?"
                        ↓
    WebScribe searches its locally indexed chunks
                        ↓
    Returns: {
      heading_path: ["Button", "Loading", "API"],
      content: "Use the `loading` prop on LoadingButton from @mui/lab.
                <LoadingButton loading variant='contained'>Submit</LoadingButton>",
      score: 4.1
    }

    No more hallucinations from outdated training data. No more full-page dumps. Just the right chunk of the right doc, every time.


    ✨ Features

    Feature Description
    FTS5 Full-Text Search SQLite FTS5 with BM25 relevance ranking — not LIKE %query%
    🧠 Semantic Chunking Splits pages at H1/H2/H3 boundaries with full breadcrumb heading_path
    🚀 Concurrent Crawling 3 parallel Playwright pages via p-limit
    🗺️ Sitemap Discovery Auto-fetches sitemap.xml before link-following for complete coverage
    🤖 robots.txt Compliance Respects allow/disallow + Crawl-Delay per domain
    🔄 Smart Re-crawl ETag / Last-Modified check — unchanged pages are skipped automatically
    🎯 Site Profiles Bundled per-framework CSS extractors (Docusaurus, VitePress, MkDocs, Nextra…)
    💻 Code Language Detection Detects language-* / prism-* / hljs-* classes → explicit fenced code blocks
    📦 Structured JSON Output Search returns { heading_path, content, score } arrays — not markdown blobs
    🔒 MIT Only Zero GPL / LGPL / copyleft dependencies

    🚀 Quick Start

    Add to your MCP client configuration (Claude Desktop, Cursor, VS Code, etc.):

    {
      "mcpServers": {
        "webscribe": {
          "command": "npx",
          "args": ["-y", "webscribe-mcp@latest"]
        }
      }
    }

    Run locally

    git clone https://github.com/yourname/http-crawl-mcp
    cd http-crawl-mcp
    npm install
    npm run dev

    Requirements: Node.js ≥ 18, macOS / Linux / Windows (WSL recommended)


    🛠️ MCP Tools

    read_and_extract_page

    Visits a single URL, strips navigation/sidebars, and returns clean Markdown with language-tagged code blocks.

    // Input
    { "url": "https://cube.dev/docs/product/data-modeling/overview" }

    crawl_documentation_site

    BFS crawl from a starting URL. Discovers pages via sitemap.xml first, then link-following.
    Respects robots.txt. Skips unchanged pages. Expands sub-tabs automatically.

    // Input
    {
      "start_url": "https://nextjs.org/docs",
      "max_pages": 50,
      "version": "v15",
      "url_glob": "**/docs/**",
      "expand_tabs": true
    }
    
    // Output
    { "crawled_count": 47, "urls": ["https://nextjs.org/docs/...", "..."] }

    search_crawled_docs

    BM25-ranked full-text search over all indexed chunks. Returns structured JSON with heading breadcrumbs.

    // Input
    { "query": "button aria label accessibility" }
    
    // Output
    {
      "results": [
        {
          "url": "https://saltdesignsystem.com/salt/components/button/accessibility",
          "title": "Button — Salt Design System",
          "heading_path": ["Button", "Accessibility", "Keyboard Navigation"],
          "content": "The button component supports Space and Enter key activation...",
          "score": 3.82,
          "version": "latest"
        }
      ]
    }

    heading_path is the key differentiator — agents get full breadcrumb context so they know exactly where in the docs the answer sits.


    crawl_component_docs

    Two-phase crawler for component library index pages.
    Phase 1 discovers component links. Phase 2 crawls each component + sub-tabs (/usage, /examples, /api, /accessibility) concurrently.

    { "index_url": "https://saltdesignsystem.com/salt/components/", "max_pages": 200 }

    Works with Salt DS, MUI, Ant Design, Chakra UI, Radix, and any similar design system.


    get_document

    Retrieves the full cached Markdown for a specific URL. Use when a search chunk needs more surrounding context.

    { "url": "https://nextjs.org/docs/routing/introduction", "version": "v15" }

    get_index_stats

    Returns the total pages and semantic chunks in the local index.

    // Output
    { "total_pages": 142, "total_chunks": 891 }

    ⚙️ Architecture

    ┌─────────────────────────────────────────────────────────┐
    │         MCP Client (Cursor · Claude Desktop · Copilot)  │
    └────────────────────────┬────────────────────────────────┘
                             │  JSON-RPC / stdio
    ┌────────────────────────▼────────────────────────────────┐
    │               WebScribe  MCP Server                     │
    │                   src/index.ts                          │
    └──────┬──────────────────────────┬───────────────────────┘
           │                          │
    ┌──────▼──────────┐    ┌──────────▼──────────────────────┐
    │  crawler.ts     │    │  db.ts  (SQLite WAL)             │
    │                 │    │                                  │
    │  sitemap.ts     │    │  documents_v2  url·version·etag  │
    │  robots.ts      │    │  chunks        heading_path·text │
    │  profiles.ts    │    │  chunks_fts    FTS5 BM25 index   │
    │  chunker.ts     │    └──────────────────────────────────┘
    │  utils.ts       │
    └─────────────────┘

    Crawl pipeline

    Step What happens
    ① Sitemap Fetch /sitemap.xml — handles sitemapindex recursion, seeds the full URL queue
    ② robots.txt Parse once per domain, cache in-memory, skip disallowed URLs
    ③ ETag check HEAD request — skip pages whose ETag/Last-Modified hasn't changed
    ④ Playwright Full SPA render + reveal hidden content (show-code buttons, accordions)
    ⑤ Profile match Framework-specific CSS selectors (or generic <main> fallback)
    ⑥ Noise removal Strip nav, sidebar, footer, cookie banners, ads
    ⑦ Markdown Turndown + GFM + explicit code language tags
    ⑧ Chunking Split at H1/H2/H3 → { heading_path: [...], content: "..." }
    ⑨ Persist Full doc + chunks stored; FTS5 index updated via SQLite triggers

    🎯 Supported Frameworks

    Profile Matched Sites
    Docusaurus React, any Docusaurus v2/v3 site
    VitePress Vue.js docs, Vite, Vitest
    MkDocs Material FastAPI, Django, Python ecosystem
    Nextra Vercel docs, SWR, Next.js
    ReadTheDocs / Sphinx Python, ReadTheDocs.io
    Cube.dev cube.dev/docs
    Stripe stripe.com/docs
    Generic Everything else — semantic <main> / [role="main"]

    💾 Storage

    Data is stored in ~/.universal-docs-mcp/documents.db — survives npx package updates.

    ~/.universal-docs-mcp/documents.db
    ├── documents_v2   — full page markdown per (url, version), with etag cache headers
    ├── chunks         — heading-based sections with JSON heading_path breadcrumb
    └── chunks_fts     — FTS5 virtual table powering BM25 search  

    Multiple doc versions coexist: crawl with version: "v14" and version: "v15" — they live side by side.


    🧪 Tests

    # Unit tests — no browser, ~5 seconds
    npm run test:unit
    
    # Live integration test — Playwright crawl of cube.dev
    npm run test:integration
    
    # Full suite
    npm test
    Test file Tests What it covers
    tests/chunker.test.ts 5 Heading splitting, breadcrumbs, preamble, stack reset, edge cases
    tests/robots.test.ts 3 Live allow/disallow (example.com, httpbin.org), cache re-use
    tests/sitemap.test.ts 3 Missing sitemap fallback, live parse (1044 URLs from cube.dev), path filter
    tests/db.test.ts 5 FTS5 insert+search, BM25 ranking, ETag cache, empty result
    tests/integration.test.ts 6 End-to-end: crawl → chunk → FTS5 search (live against cube.dev)

    Smoke tests (manual/interactive):

    npm run smoke:extraction    # Single page extract on example.com
    npm run smoke:crawler       # BFS crawler on turndown.js.org
    npm run smoke:card          # Salt DS Card component extraction
    npm run smoke:mcp-client    # All 6 MCP tools via stdio
    npm run smoke:mcp-crawl     # Full crawl + search flow via MCP

    📦 Dependencies

    All runtime dependencies are MIT licensed — no GPL, LGPL, or copyleft.

    Package Purpose License
    @modelcontextprotocol/sdk MCP server protocol MIT
    playwright Headless Chromium SPA rendering Apache-2.0 ¹
    better-sqlite3 SQLite + FTS5 storage MIT
    jsdom HTML DOM parsing MIT
    turndown + turndown-plugin-gfm HTML → Markdown + GFM tables MIT
    p-limit Concurrency pool MIT
    robots-parser robots.txt parsing & compliance MIT

    ¹ Apache-2.0 is permissive (not copyleft). The Chromium binary uses the BSD-style Chromium license.


    📄 License

    MIT © Rohit