Que fait video-digest ?
Overview
Turn any video into a single searchable Markdown digest. For each clip the skill
produces three things and stitches them into one digest.md:
- Transcript — the spoken words (faster-whisper, on-device).
- On-screen text — every unique line of text visible in the frames (OCR), with URLs and @handles pulled out separately.
- Contact-sheet montages — grids of frames spanning the whole clip, so you can read its visual story at a glance.
Everything runs locally. No network calls, no accounts, no telemetry. The core
needs only ffmpeg; transcript and OCR are optional and degrade gracefully.
When to use
- "What's in this video / these 200 videos?" without watching them.
- Building a searchable catalogue of reels, talks, tutorials, or footage.
- Recovering the links and handles flashed on screen in a clip.
- Getting a fast visual overview (montages) of long or many recordings.
Not for: live streams, editing/cutting video, or generating video. This reads video and emits text + image summaries.
Quick reference
scripts/check-deps.sh # what's installed + how to get the rest
scripts/video-digest.sh clip.mp4 # one video → ./video-digest-out/clip/digest.md
scripts/video-digest.sh --out ~/digests *.mp4 # batch a whole folder
scripts/video-digest.sh --lang en talk.mov # hint the transcript language
scripts/video-digest.sh --no-transcript reel.mp4 # skip audio, just visuals+text
scripts/video-digest.sh --ocr-engine tesseract x.mp4 # force a specific OCR engine
| Flag | Purpose | Default |
|---|---|---|
-o, --out DIR | output root | ./video-digest-out |
--fps N | OCR frame sampling rate | 2 |
--montage-frames N | frames spread across the clip | 90 |
--tile CxR | montage grid | 3x3 |
--ocr-engine E | auto|vision|tesseract|none | auto |
--whisper-model M | tiny|base|small|medium(+.en) | small |
--whisper-python P | interpreter that has faster-whisper | auto-detected |
--lang CODE | transcript language hint | auto-detect |
--no-transcript / --no-ocr / --no-montage | skip a stage | — |
--force | re-run stages even if cached | off |
Run scripts/video-digest.sh --help for the full list.
How it works
The entry point scripts/video-digest.sh orchestrates three small, independent
tools you can also run on their own:
| Stage | Script | Engine |
|---|---|---|
| Montages | scripts/montage.sh | ffmpeg tile filter |
| On-screen text | scripts/ocr.sh (+ ocr_vision.swift) | macOS Vision (default) or tesseract |
| Transcript | scripts/transcribe.py | faster-whisper |
Each stage is resumable — completed work is marked .done / cached and
skipped on re-runs unless you pass --force. The digest is always rebuilt from
whatever stages are present, so partial runs still produce a useful file.
OCR engine selection is automatic: on macOS it uses the built-in Vision
framework (no install, on-device), otherwise tesseract if present, otherwise
it skips OCR with a warning. See references/setup.md for installing optional
backends and references/output-format.md for the exact output layout.
Common mistakes
- Empty transcript → faster-whisper isn't installed in the interpreter being
used. Run
scripts/check-deps.sh, thenpip install faster-whisper, or point at an existing env with--whisper-python /path/to/python. - No on-screen text on Linux → install
tesseract(apt install tesseract-ocr); the Vision engine is macOS-only. - Huge montage count on short clips → lower
--montage-frames(e.g.30). - Wrong language transcript → pass
--lang(e.g.--lang sk), or use a non-.enmodel for multilingual audio.
Example
examples/sample-digest/ contains a tiny generated clip and the digest.md it
produces — transcript, one detected URL, one @handle, and a montage sheet.
Reproduce it with (the transcript line needs faster-whisper installed):
scripts/video-digest.sh --montage-frames 9 examples/sample-digest/sample.mp4