codecell-germany/gemini-pdf-img-ocr-agent-skill

Gemini-powered OCR via a global CLI, with onboarding-first setup, Markdown-first output, and explicit PDF mode control.

Compatible avec~Claude CodeCodex CLI~CursorGemini CLI
npx add-skill codecell-germany/gemini-pdf-img-ocr-agent-skill

name: gemini-ocr-cli description: "Gemini-powered OCR via a global CLI, with onboarding-first setup, Markdown-first output, and explicit PDF mode control."

gemini-ocr-cli

When to use

Use this skill when a task needs high-quality OCR from a local terminal workflow and the host model’s own OCR is not reliable enough.

Use it especially when an agent must:

  • OCR a local image into faithful Markdown,
  • OCR a local PDF into faithful Markdown,
  • preserve page structure, tables, and labels as well as possible,
  • work through a deterministic CLI instead of ad-hoc multimodal prompting.

Preconditions

  • Ensure the CLI is globally available on PATH:
    • preferred: npm install -g @codecell-germany/gemini-ocr-agent-skill
    • verify: gemini-ocr --help
  • Install the skill payload if needed:
    • gemini-ocr-skill install --force
  • Required secret env:
    • GEMINI_API_KEY=<api-key>
  • Supported alias secret env:
    • GOOGLE_GENERATIVE_AI_API_KEY=<api-key>
  • Supported fallback secret env:
    • GOOGLE_API_KEY=<api-key>
  • Optional model override:
    • GEMINI_OCR_MODEL=gemini-3-flash-preview

Core workflow

  1. Verify the public CLI surface:
  • gemini-ocr --help
  1. Validate the environment:
  • gemini-ocr doctor --json
  1. If the environment is incomplete, print the setup guide:
  • gemini-ocr setup --language en
  • gemini-ocr setup --language de
  1. Export one of the supported API key env vars and rerun:
  • gemini-ocr doctor --json
  1. OCR an image:
  • gemini-ocr scan-image /absolute/path/to/image.png
  1. OCR a PDF:
  • gemini-ocr scan-pdf /absolute/path/to/document.pdf
  1. Use JSON output only when the calling workflow needs the structured OCR object:
  • gemini-ocr scan-pdf /absolute/path/to/document.pdf --format json

Guardrails

  • Use the public CLI names gemini-ocr and gemini-ocr-skill.
  • Do not bypass the product surface with repo-local entrypoints such as node dist/index.js.
  • Do not call hidden installed runtime paths such as ~/.codex/tools/gemini-ocr-cli/dist/index.js.
  • Inputs are sent to Gemini for remote processing. Do not use the tool on documents that must stay fully local unless that remote-processing policy is acceptable.
  • Default output is Markdown on stdout.
  • Diagnostics and warnings belong on stderr.
  • --format json is the machine-readable escape hatch.
  • --pdf-mode auto can retry a failed native PDF request as raster OCR when pdftoppm is available.
  • --pdf-mode raster requires pdftoppm.
  • API keys remain in shell env. Do not paste them into prompts, tickets, screenshots, or chats.
  • Prefer doctor before the first real OCR run in a new shell or environment.

References

  • Main overview: references/overview.md
  • Agent onboarding: references/agent-onboarding.md
  • OCR first run: references/ocr-first-run.md
  • Command cheat sheet: references/command-cheatsheet.md
  • Architecture: knowledge/ARCHITECTURE.md
  • Release checklist: knowledge/RELEASE_CHECKLIST.md

Skills associés