JSPM

vecpdf

0.0.1
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 4
  • Score
    100M100P100Q33698F
  • License MIT

CLI tool to process PDFs and create local vector databases using ChromaDB

Package Exports

  • vecpdf
  • vecpdf/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (vecpdf) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

vecpdf icon

Badges

Install npm Publish Install size Downloads License

vecpdf — PDF → ChromaDB (HTTP server)

vecpdf is a tiny CLI that:

  • an excuse to not need to rely on Pinecone, etc.
  • extracts text from a PDF (via Python PyMuPDF),
  • splits the text into chunks (token-aware when tiktoken is available),
  • and indexes those chunks into a ChromaDB collection over HTTP.

Note: Chroma is a local vector database. vecpdf talks to a running Chroma server (default http://localhost:8000). Reminder - vectors will live inside the Chroma server, not in your project folder.


Requirements

  • Python with:
    pip install PyMuPDF tiktoken
    (tiktoken is optional, but gives nicer chunking.)
  • ChromaDB server running locally (HTTP). By default, vecpdf uses http://localhost:8000.

Use a specific Python (virtualenv)

# PowerShell example (Windows)
$env:VECPDF_PYTHON="C:\Path\to\your\venv\Scripts\python.exe"

# macOS/Linux example
export VECPDF_PYTHON="$HOME/.venvs/vecpdf/bin/python"

Chroma server URL

Default: http://localhost:8000
To use a different server:

export CHROMA_URL="http://localhost:8001"

Quick Start

Create a tiny sample PDF:

python - <<'PY'
import fitz
doc = fitz.open()
page = doc.new_page()
page.insert_text((72,72), "Neural networks learn by adjusting weights.\nEmbeddings map meaning to vectors.")
doc.save("sample.pdf"); doc.close()
PY

Process the PDF:

# Basic usage (indexes into the 'documents' collection)
vecpdf process sample.pdf

# Append to an existing collection instead of recreating it
vecpdf process sample.pdf --keep-existing

# Use a custom chunk ID prefix (helps avoid collisions + label sources)
vecpdf process sample.pdf --id-prefix "paperA_"

# Adjust chunk size (tokens)
vecpdf process sample.pdf -s 800

Query the collection:

# Top 3 results (preview)
vecpdf query "neural networks" -c documents -n 3

# Print full text for each result
vecpdf query "neural networks" -c documents -n 3 --full

CLI Reference

vecpdf process <pdf-path> [options]

  • <pdf-path>: Path to your PDF file (required)
  • -c, --collection <name>: Chroma collection name (default: documents)
  • -s, --chunk-size <size>: Token chunk size (default: 500)
  • --python-script <path>: Use your own Python script (advanced)
  • --keep-existing: Append to existing collection instead of recreating it
  • --id-prefix <prefix>: Custom prefix for new chunk IDs (default: chunk_)

vecpdf query <query-text> [options]

  • <query-text>: Text to search for (required)
  • -c, --collection <name>: Collection name (default: documents)
  • -n, --results <number>: Number of results to return (default: 5)
  • --full: Show full text for each result (instead of a preview)

Where data lives

  • vecpdf talks to a running Chroma server over HTTP (default http://localhost:8000).
  • Documents and vectors are stored by that server (not in a local ./vectordb folder).

Troubleshooting

Python extraction errors

  • Make sure PyMuPDF is installed:
    pip install PyMuPDF
  • If tiktoken is missing, vecpdf falls back to a simple character split (still works).

Embedding/Indexing errors

  • Your Chroma server needs an embedder. One path is:
    pip install chromadb sentence-transformers
  • If you see duplicate-ID errors, try a different --id-prefix or run without --keep-existing.

No results

  • Increase -n, try a simpler query, or confirm the -c collection name.

License

MIT