Package Exports

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (vidclaude) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

vidclaude

Multimodal video understanding for Claude Code. Extract frames, transcribe audio in 90+ languages, build temporal timelines — all from a single command. No API key needed.

pip install vidclaude
vidclaude video.mp4 --mode standard --verbose

What it does

Drop a video in, get structured evidence out. Claude in your conversation does the thinking.

Video File
  │
  ├─ ffmpeg ──────────► Frames (adaptive, shot-aware sampling)
  ├─ faster-whisper ──► Transcript with timestamps (large-v3, 90+ languages)
  ├─ pytesseract ─────► On-screen text / OCR (optional)
  ├─ scene detection ─► Shot boundaries
  │
  └─► Timeline ──► evidence.md ──► Claude reasons over it

Works with Hindi, English, Spanish, Japanese, Arabic — any of the 90+ languages Whisper supports. Language is auto-detected.

Prerequisites

Requirement	Install
Python 3.10+	python.org
ffmpeg	Windows: `winget install ffmpeg` / macOS: `brew install ffmpeg` / Linux: `sudo apt install ffmpeg`

Install

pip install vidclaude

That's it. First run downloads the Whisper model (~3GB, one time).

Usage

With Claude Code (recommended)

Set up the skill once:

vidclaude --install-skill

Then in Claude Code, just say:

"analyze the video at C:/Users/me/Videos/meeting.mp4"

"what does the speaker say about the budget in presentation.mp4?"

"when does the logo appear in intro.mov?"

Claude runs the extraction, reads the evidence report + frames, and answers your question. Your Max/Pro plan covers everything — no API key needed.

Follow-up questions about the same video are instant (cached).

From the command line

# Standard analysis — good for most videos
vidclaude video.mp4 --mode standard --verbose

# Quick — fewer frames, faster
vidclaude video.mp4 --mode quick

# Deep — dense frames, full OCR, detailed
vidclaude video.mp4 --mode deep --verbose

# Batch process a folder
vidclaude ./videos/ --verbose

# Skip audio transcription
vidclaude video.mp4 --no-audio

# Force fresh extraction (ignore cache)
vidclaude video.mp4 --no-cache

Output

Every run creates a .vidcache/ directory next to the video:

.vidcache/a3f7b2c1/
  evidence.md        ← Human-readable report (Claude reads this)
  frames/            ← Extracted JPEG frames
  transcript.json    ← Timestamped transcript
  timeline.json      ← Unified event timeline
  meta.json          ← Video metadata
  shots.json         ← Shot boundaries
  ocr.json           ← On-screen text
  summaries.json     ← Scene/chapter summaries

Modes

Mode	Frames	Whisper model	OCR	Best for
`quick`	~20, uniform	base	skip	Short clips, fast overview
`standard`	~60, shot-aware	large-v3	keyframes	General use
`deep`	~150, burst sampling	large-v3	all frames	Long videos, detailed analysis

Smart frame budget: If a video is too long for the frame limit, FPS is automatically reduced. Shot boundary frames are always prioritized.

CLI Reference

vidclaude [input] [options]

positional:
  input                    Video file or folder

setup:
  --install-skill          Copy SKILL.md to current dir for Claude Code

processing:
  --mode {quick,standard,deep}   Processing mode (default: standard)
  -f, --fps N              Override frames per second
  -m, --max-frames N       Override max frame count
  --no-audio               Skip audio transcription
  --no-ocr                 Skip OCR extraction
  --no-cache               Force re-extraction

output:
  --extract                Print cache path summary
  -o, --output FILE        Write output to file
  --verbose                Show detailed progress
  --batch-summary          Cross-video summary for folders

How it works

Built on a multi-layer video understanding architecture:

Layer A — Ingestion: Validates format (MP4, MOV, MKV, WebM, AVI), extracts metadata via ffprobe
Layer B — Segmentation: Detects shot boundaries using ffmpeg's scene change filter
Layer C — Adaptive Sampling: Content-aware frame selection — more frames at scene transitions, smart frame budgets per mode
Layer D — Audio: faster-whisper with large-v3 for multilingual transcription with word-level timestamps and VAD filtering
Layer E — OCR: pytesseract text extraction from key frames (optional)
Layer G — Timeline: Merges speech, OCR, and scene events into a single time-sorted list
Layer I — Memory: Hierarchical summaries for longer videos (scene → chapter → global)
Layer J — Evidence Assembly: Generates evidence.md with frame references, transcript, timeline for Claude to reason over

Intent-aware processing

The tool classifies your question and adjusts the pipeline:

Question type	Example	What happens
Description	"What happens in this video?"	Balanced extraction
Moment retrieval	"When does the person stand up?"	Prioritizes transcript + timeline
Temporal ordering	"Does X happen before Y?"	Prioritizes timeline events
Counting	"How many cars appear?"	Denser frame sampling
OCR / text	"What text is on the slide?"	Prioritizes OCR extraction
Speech	"What did they say about revenue?"	Prioritizes transcript

Optional extras

# OCR support (also needs Tesseract binary installed)
pip install pytesseract

# Standalone API mode (use outside Claude Code)
pip install anthropic
export ANTHROPIC_API_KEY=sk-...
vidclaude video.mp4 --api -q "What happens in this video?"

Examples

Analyze a meeting recording:

vidclaude meeting.mp4 --mode standard --verbose
# → 55 frames, full transcript, timeline
# → In Claude Code: "summarize the key decisions from this meeting"

Review security footage:

vidclaude ./cameras/ --mode deep --verbose
# → Batch processes all videos in the folder
# → Dense frame extraction catches fast events

Extract text from a lecture:

pip install pytesseract
vidclaude lecture.mp4 --mode standard --verbose
# → Captures slide text via OCR + speaker transcript

Quick check on a short clip:

vidclaude clip.mp4 --mode quick
# → 20 frames, fast whisper, done in seconds

Troubleshooting

Problem	Fix
`ffmpeg not found`	Install: `winget install ffmpeg` (Windows) / `brew install ffmpeg` (Mac)
`No module named 'faster_whisper'`	`pip install faster-whisper`
Slow first run	Normal — downloading Whisper large-v3 model (~3GB, one time)
Wrong language detected	Whisper auto-detects; usually correct. For edge cases, the transcript still captures phonetics
Large `.vidcache` folder	Delete it to free space: `rm -rf .vidcache/`
Want fresh extraction	Use `--no-cache` flag
OCR not working	Install pytesseract + Tesseract binary

Project structure

vidclaude/
  cli.py          Argument parsing, orchestration, caching
  models.py       Data model (VideoMeta, Frame, Shot, TranscriptChunk, etc.)
  ingest.py       Video validation + ffprobe metadata
  segment.py      Shot detection + adaptive frame sampling
  audio.py        faster-whisper transcription (large-v3)
  ocr.py          pytesseract text extraction
  intent.py       Question intent classification
  timeline.py     Temporal event merging
  memory.py       Hierarchical summaries
  reason.py       Evidence assembly + optional API mode
  util.py         ffmpeg helpers, image encoding, caching
  SKILL.md        Claude Code skill definition (bundled)

License

MIT