Communitygithub.com

haoyiyin/sadtalker-mac

Agent skill: one-command talking-head video generation on macOS. Give it a photo + audio → get a digital human video. Works with any AI agent (Claude Code, Pi, Cursor, Codex, Copilot).

Works withClaude CodeCodex CLICursor
npx skills add haoyiyin/sadtalker-mac

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

Documentation

When to Use

User wants to create a talking-head / digital-human video from a portrait photo and an audio file (口播视频, 数字人生成, photo talk). Triggers: "生成口播视频", "make this photo talk", "照片说话", "数字人视频", "sadtalker", "talking head", "avatar video from photo". Applies to macOS (Apple Silicon or Intel) only.

Procedure

  1. Locate the scripts: they live in the same directory as this SKILL.md. Find them with: ls "$(dirname <path-to-this-SKILL.md>)" — you'll see setup.sh and generate.sh alongside this file.
  2. Check prerequisites: ffmpeg, conda env sadtalker, ~/SadTalker/checkpoints/. If anything is missing, run setup.sh (first time only, ~15 min, ~2.5GB download).
  3. Confirm the user has provided a portrait photo (.jpg/.png) and speech audio file (.wav/.mp3). If audio is in another format, convert with ffmpeg first.
  4. Run generation: bash generate.sh <photo> <audio> [--enhancer gfpgan] [--still] [--preprocess full|resize|crop]. This invokes conda run -n sadtalker python inference.py under the hood.
  5. The output .mp4 appears in ~/SadTalker/results//. Open it with: open <path> (macOS). Report the path to the user.

Pitfalls

  • Python MUST be 3.10 inside the conda env. 3.8 misses Apple Silicon wheels, 3.11+ has no scikit-image==0.19.3 wheel.
  • 8GB RAM: close Chrome/IDE before running. SadTalker is pure CPU on Mac — no GPU path exists.
  • dlib must be installed separately on Mac: conda run -n sadtalker pip install dlib — this is the #1 M1 error.
  • Install torch WITHOUT CUDA suffix: plain pip install torch torchvision torchaudio. If torch.version contains '+cu', reinstall.
  • ffmpeg via brew, not conda. Conda's ffmpeg sometimes lacks needed codecs.
  • First run downloads ~2GB checkpoints. In China, set HF_ENDPOINT=https://hf-mirror.com first.
  • GFPGAN enhancer (--enhancer gfpgan) roughly doubles processing time. Skip for quick previews.
  • --still mode expects full-body source photos when combined with --preprocess full.
  • Audio must be .wav or .mp3. Other formats error with 'Header missing'. Convert with ffmpeg: ffmpeg -i input.m4a output.wav.

Verification

  1. conda run -n sadtalker python -c 'import torch; print(torch.version)' — must NOT contain '+cu'
  2. ls ~/SadTalker/checkpoints/ — must contain auido2exp_00300-model.pth, auido2pose_00140-model.pth, epoch_20.pth, and other .pth files
  3. conda run -n sadtalker python -c 'import dlib; print("ok")' — must print 'ok' (no Illegal Hardware Instruction)
  4. Smoke test: cd ~/SadTalker && conda run -n sadtalker python inference.py --driven_audio examples/driven_audio/bus_chinese.wav --source_image examples/source_image/full_body_1.png — produces output in results/

Related Skills