Package Exports

vecpdf
vecpdf/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (vecpdf) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Badges

vecpdf — PDF → ChromaDB (HTTP server)

vecpdf is a tiny CLI that:

an excuse to not need to rely on Pinecone, etc.
extracts text from a PDF (via Python PyMuPDF),
splits the text into chunks (token-aware when tiktoken is available),
and indexes those chunks into a ChromaDB collection over HTTP.

Note: Chroma is a local vector database. vecpdf talks to a running Chroma server (default http://localhost:8000). Reminder - vectors will live inside the Chroma server, not in your project folder.

Requirements

Python with:
```
pip install PyMuPDF tiktoken
```
(tiktoken is optional, but gives nicer chunking.)
ChromaDB server running locally (HTTP). By default, vecpdf uses http://localhost:8000.

Use a specific Python (virtualenv)

# PowerShell example (Windows)
$env:VECPDF_PYTHON="C:\Path\to\your\venv\Scripts\python.exe"

# macOS/Linux example
export VECPDF_PYTHON="$HOME/.venvs/vecpdf/bin/python"

Chroma server URL

Default: http://localhost:8000
To use a different server:

export CHROMA_URL="http://localhost:8001"

Quick Start

Create a tiny sample PDF:

python - <<'PY'
import fitz
doc = fitz.open()
page = doc.new_page()
page.insert_text((72,72), "Neural networks learn by adjusting weights.\nEmbeddings map meaning to vectors.")
doc.save("sample.pdf"); doc.close()
PY

Process the PDF:

# Basic usage (indexes into the 'documents' collection)
vecpdf process sample.pdf

# Append to an existing collection instead of recreating it
vecpdf process sample.pdf --keep-existing

# Use a custom chunk ID prefix (helps avoid collisions + label sources)
vecpdf process sample.pdf --id-prefix "paperA_"

# Adjust chunk size (tokens)
vecpdf process sample.pdf -s 800

Query the collection:

# Top 3 results (preview)
vecpdf query "neural networks" -c documents -n 3

# Print full text for each result
vecpdf query "neural networks" -c documents -n 3 --full

CLI Reference

`vecpdf process <pdf-path> [options]`

<pdf-path>: Path to your PDF file (required)
-c, --collection <name>: Chroma collection name (default: documents)
-s, --chunk-size <size>: Token chunk size (default: 500)
--python-script <path>: Use your own Python script (advanced)
--keep-existing: Append to existing collection instead of recreating it
--id-prefix <prefix>: Custom prefix for new chunk IDs (default: chunk_)

`vecpdf query <query-text> [options]`

<query-text>: Text to search for (required)
-c, --collection <name>: Collection name (default: documents)
-n, --results <number>: Number of results to return (default: 5)
--full: Show full text for each result (instead of a preview)

Where data lives

vecpdf talks to a running Chroma server over HTTP (default http://localhost:8000).
Documents and vectors are stored by that server (not in a local ./vectordb folder).

Troubleshooting

Python extraction errors

Make sure PyMuPDF is installed:
```
pip install PyMuPDF
```
If tiktoken is missing, vecpdf falls back to a simple character split (still works).

Embedding/Indexing errors

Your Chroma server needs an embedder. One path is:
```
pip install chromadb sentence-transformers
```
If you see duplicate-ID errors, try a different --id-prefix or run without --keep-existing.

No results

Increase -n, try a simpler query, or confirm the -c collection name.

License

MIT