Package Exports
- vecpdf
- vecpdf/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (vecpdf) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Badges
vecpdf — PDF → ChromaDB (HTTP server)
vecpdf is a tiny CLI that:
- an excuse to not need to rely on Pinecone, etc.
- extracts text from a PDF (via Python PyMuPDF),
- splits the text into chunks (token-aware when
tiktokenis available), - and indexes those chunks into a ChromaDB collection over HTTP.
Note: Chroma is a local vector database. vecpdf talks to a running Chroma server (default
http://localhost:8000). Reminder - vectors will live inside the Chroma server, not in your project folder.
Requirements
- Python with:(tiktoken is optional, but gives nicer chunking.)
pip install PyMuPDF tiktoken - ChromaDB server running locally (HTTP). By default, vecpdf uses
http://localhost:8000.
Use a specific Python (virtualenv)
# PowerShell example (Windows)
$env:VECPDF_PYTHON="C:\Path\to\your\venv\Scripts\python.exe"
# macOS/Linux example
export VECPDF_PYTHON="$HOME/.venvs/vecpdf/bin/python"Chroma server URL
Default: http://localhost:8000
To use a different server:
export CHROMA_URL="http://localhost:8001"Quick Start
Create a tiny sample PDF:
python - <<'PY'
import fitz
doc = fitz.open()
page = doc.new_page()
page.insert_text((72,72), "Neural networks learn by adjusting weights.\nEmbeddings map meaning to vectors.")
doc.save("sample.pdf"); doc.close()
PYProcess the PDF:
# Basic usage (indexes into the 'documents' collection)
vecpdf process sample.pdf
# Append to an existing collection instead of recreating it
vecpdf process sample.pdf --keep-existing
# Use a custom chunk ID prefix (helps avoid collisions + label sources)
vecpdf process sample.pdf --id-prefix "paperA_"
# Adjust chunk size (tokens)
vecpdf process sample.pdf -s 800Query the collection:
# Top 3 results (preview)
vecpdf query "neural networks" -c documents -n 3
# Print full text for each result
vecpdf query "neural networks" -c documents -n 3 --fullCLI Reference
vecpdf process <pdf-path> [options]
<pdf-path>: Path to your PDF file (required)-c, --collection <name>: Chroma collection name (default:documents)-s, --chunk-size <size>: Token chunk size (default:500)--python-script <path>: Use your own Python script (advanced)--keep-existing: Append to existing collection instead of recreating it--id-prefix <prefix>: Custom prefix for new chunk IDs (default:chunk_)
vecpdf query <query-text> [options]
<query-text>: Text to search for (required)-c, --collection <name>: Collection name (default:documents)-n, --results <number>: Number of results to return (default:5)--full: Show full text for each result (instead of a preview)
Where data lives
- vecpdf talks to a running Chroma server over HTTP (default
http://localhost:8000). - Documents and vectors are stored by that server (not in a local
./vectordbfolder).
Troubleshooting
Python extraction errors
- Make sure PyMuPDF is installed:
pip install PyMuPDF - If tiktoken is missing, vecpdf falls back to a simple character split (still works).
Embedding/Indexing errors
- Your Chroma server needs an embedder. One path is:
pip install chromadb sentence-transformers - If you see duplicate-ID errors, try a different
--id-prefixor run without--keep-existing.
No results
- Increase
-n, try a simpler query, or confirm the-ccollection name.
License
MIT