JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 6
  • Score
    100M100P100Q49273F
  • License MIT

Multimodal video understanding for Claude Code — extract frames, transcribe audio, build timelines from any video

Package Exports

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (vidclaude) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    vidclaude

    Multimodal video understanding for Claude Code. Extract frames, transcribe audio in 90+ languages, build temporal timelines — all from a single command. No API key needed.

    pip install vidclaude
    vidclaude video.mp4 --mode standard --verbose

    What it does

    Drop a video in, get structured evidence out. Claude in your conversation does the thinking.

    Video File
      │
      ├─ ffmpeg ──────────► Frames (adaptive, shot-aware sampling)
      ├─ faster-whisper ──► Transcript with timestamps (large-v3, 90+ languages)
      ├─ pytesseract ─────► On-screen text / OCR (optional)
      ├─ scene detection ─► Shot boundaries
      │
      └─► Timeline ──► evidence.md ──► Claude reasons over it

    Works with Hindi, English, Spanish, Japanese, Arabic — any of the 90+ languages Whisper supports. Language is auto-detected.

    Prerequisites

    Requirement Install
    Python 3.10+ python.org
    ffmpeg Windows: winget install ffmpeg / macOS: brew install ffmpeg / Linux: sudo apt install ffmpeg

    Install

    pip install vidclaude

    That's it. First run downloads the Whisper model (~3GB, one time).

    Usage

    Set up the skill once:

    vidclaude --install-skill

    Then in Claude Code, just say:

    "analyze the video at C:/Users/me/Videos/meeting.mp4"

    "what does the speaker say about the budget in presentation.mp4?"

    "when does the logo appear in intro.mov?"

    Claude runs the extraction, reads the evidence report + frames, and answers your question. Your Max/Pro plan covers everything — no API key needed.

    Follow-up questions about the same video are instant (cached).

    From the command line

    # Standard analysis — good for most videos
    vidclaude video.mp4 --mode standard --verbose
    
    # Quick — fewer frames, faster
    vidclaude video.mp4 --mode quick
    
    # Deep — dense frames, full OCR, detailed
    vidclaude video.mp4 --mode deep --verbose
    
    # Batch process a folder
    vidclaude ./videos/ --verbose
    
    # Skip audio transcription
    vidclaude video.mp4 --no-audio
    
    # Force fresh extraction (ignore cache)
    vidclaude video.mp4 --no-cache

    Output

    Every run creates a .vidcache/ directory next to the video:

    .vidcache/a3f7b2c1/
      evidence.md        ← Human-readable report (Claude reads this)
      frames/            ← Extracted JPEG frames
      transcript.json    ← Timestamped transcript
      timeline.json      ← Unified event timeline
      meta.json          ← Video metadata
      shots.json         ← Shot boundaries
      ocr.json           ← On-screen text
      summaries.json     ← Scene/chapter summaries

    Modes

    Mode Frames Whisper model OCR Best for
    quick ~20, uniform base skip Short clips, fast overview
    standard ~60, shot-aware large-v3 keyframes General use
    deep ~150, burst sampling large-v3 all frames Long videos, detailed analysis

    Smart frame budget: If a video is too long for the frame limit, FPS is automatically reduced. Shot boundary frames are always prioritized.

    CLI Reference

    vidclaude [input] [options]
    
    positional:
      input                    Video file or folder
    
    setup:
      --install-skill          Copy SKILL.md to current dir for Claude Code
    
    processing:
      --mode {quick,standard,deep}   Processing mode (default: standard)
      -f, --fps N              Override frames per second
      -m, --max-frames N       Override max frame count
      --no-audio               Skip audio transcription
      --no-ocr                 Skip OCR extraction
      --no-cache               Force re-extraction
    
    output:
      --extract                Print cache path summary
      -o, --output FILE        Write output to file
      --verbose                Show detailed progress
      --batch-summary          Cross-video summary for folders

    How it works

    Built on a multi-layer video understanding architecture:

    • Layer A — Ingestion: Validates format (MP4, MOV, MKV, WebM, AVI), extracts metadata via ffprobe
    • Layer B — Segmentation: Detects shot boundaries using ffmpeg's scene change filter
    • Layer C — Adaptive Sampling: Content-aware frame selection — more frames at scene transitions, smart frame budgets per mode
    • Layer D — Audio: faster-whisper with large-v3 for multilingual transcription with word-level timestamps and VAD filtering
    • Layer E — OCR: pytesseract text extraction from key frames (optional)
    • Layer G — Timeline: Merges speech, OCR, and scene events into a single time-sorted list
    • Layer I — Memory: Hierarchical summaries for longer videos (scene → chapter → global)
    • Layer J — Evidence Assembly: Generates evidence.md with frame references, transcript, timeline for Claude to reason over

    Intent-aware processing

    The tool classifies your question and adjusts the pipeline:

    Question type Example What happens
    Description "What happens in this video?" Balanced extraction
    Moment retrieval "When does the person stand up?" Prioritizes transcript + timeline
    Temporal ordering "Does X happen before Y?" Prioritizes timeline events
    Counting "How many cars appear?" Denser frame sampling
    OCR / text "What text is on the slide?" Prioritizes OCR extraction
    Speech "What did they say about revenue?" Prioritizes transcript

    Optional extras

    # OCR support (also needs Tesseract binary installed)
    pip install pytesseract
    
    # Standalone API mode (use outside Claude Code)
    pip install anthropic
    export ANTHROPIC_API_KEY=sk-...
    vidclaude video.mp4 --api -q "What happens in this video?"

    Examples

    Analyze a meeting recording:

    vidclaude meeting.mp4 --mode standard --verbose
    # → 55 frames, full transcript, timeline
    # → In Claude Code: "summarize the key decisions from this meeting"

    Review security footage:

    vidclaude ./cameras/ --mode deep --verbose
    # → Batch processes all videos in the folder
    # → Dense frame extraction catches fast events

    Extract text from a lecture:

    pip install pytesseract
    vidclaude lecture.mp4 --mode standard --verbose
    # → Captures slide text via OCR + speaker transcript

    Quick check on a short clip:

    vidclaude clip.mp4 --mode quick
    # → 20 frames, fast whisper, done in seconds

    Troubleshooting

    Problem Fix
    ffmpeg not found Install: winget install ffmpeg (Windows) / brew install ffmpeg (Mac)
    No module named 'faster_whisper' pip install faster-whisper
    Slow first run Normal — downloading Whisper large-v3 model (~3GB, one time)
    Wrong language detected Whisper auto-detects; usually correct. For edge cases, the transcript still captures phonetics
    Large .vidcache folder Delete it to free space: rm -rf .vidcache/
    Want fresh extraction Use --no-cache flag
    OCR not working Install pytesseract + Tesseract binary

    Project structure

    vidclaude/
      cli.py          Argument parsing, orchestration, caching
      models.py       Data model (VideoMeta, Frame, Shot, TranscriptChunk, etc.)
      ingest.py       Video validation + ffprobe metadata
      segment.py      Shot detection + adaptive frame sampling
      audio.py        faster-whisper transcription (large-v3)
      ocr.py          pytesseract text extraction
      intent.py       Question intent classification
      timeline.py     Temporal event merging
      memory.py       Hierarchical summaries
      reason.py       Evidence assembly + optional API mode
      util.py         ffmpeg helpers, image encoding, caching
      SKILL.md        Claude Code skill definition (bundled)

    License

    MIT