CommunityRecherche & Datenanalysegithub.com

codecell-germany/gemini-pdf-img-ocr-agent-skill

Gemini-powered OCR via a global CLI, with onboarding-first setup, Markdown-first output, and explicit PDF mode control.

Funktioniert mit~Claude CodeCodex CLI~CursorGemini CLI
npx skills add codecell-germany/gemini-pdf-img-ocr-agent-skill

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

Dokumentation

gemini-ocr-cli

When to use

Use this skill when a task needs high-quality OCR from a local terminal workflow and the host model’s own OCR is not reliable enough.

Use it especially when an agent must:

  • OCR a local image into faithful Markdown,
  • OCR a local PDF into faithful Markdown,
  • preserve page structure, tables, and labels as well as possible,
  • work through a deterministic CLI instead of ad-hoc multimodal prompting.

Preconditions

  • Ensure the CLI is globally available on PATH:
    • preferred: npm install -g @codecell-germany/gemini-ocr-agent-skill
    • verify: gemini-ocr --help
  • Install the skill payload if needed:
    • gemini-ocr-skill install --force
  • Required secret env:
    • GEMINI_API_KEY=<api-key>
  • Supported alias secret env:
    • GOOGLE_GENERATIVE_AI_API_KEY=<api-key>
  • Supported fallback secret env:
    • GOOGLE_API_KEY=<api-key>
  • Optional model override:
    • GEMINI_OCR_MODEL=gemini-3-flash-preview

Core workflow

  1. Verify the public CLI surface:
  • gemini-ocr --help
  1. Validate the environment:
  • gemini-ocr doctor --json
  1. If the environment is incomplete, print the setup guide:
  • gemini-ocr setup --language en
  • gemini-ocr setup --language de
  1. Export one of the supported API key env vars and rerun:
  • gemini-ocr doctor --json
  1. OCR an image:
  • gemini-ocr scan-image /absolute/path/to/image.png
  1. OCR a PDF:
  • gemini-ocr scan-pdf /absolute/path/to/document.pdf
  1. Use JSON output only when the calling workflow needs the structured OCR object:
  • gemini-ocr scan-pdf /absolute/path/to/document.pdf --format json

Guardrails

  • Use the public CLI names gemini-ocr and gemini-ocr-skill.
  • Do not bypass the product surface with repo-local entrypoints such as node dist/index.js.
  • Do not call hidden installed runtime paths such as ~/.codex/tools/gemini-ocr-cli/dist/index.js.
  • Inputs are sent to Gemini for remote processing. Do not use the tool on documents that must stay fully local unless that remote-processing policy is acceptable.
  • Default output is Markdown on stdout.
  • Diagnostics and warnings belong on stderr.
  • --format json is the machine-readable escape hatch.
  • --pdf-mode auto can retry a failed native PDF request as raster OCR when pdftoppm is available.
  • --pdf-mode raster requires pdftoppm.
  • API keys remain in shell env. Do not paste them into prompts, tickets, screenshots, or chats.
  • Prefer doctor before the first real OCR run in a new shell or environment.

References

  • Main overview: references/overview.md
  • Agent onboarding: references/agent-onboarding.md
  • OCR first run: references/ocr-first-run.md
  • Command cheat sheet: references/command-cheatsheet.md
  • Architecture: knowledge/ARCHITECTURE.md
  • Release checklist: knowledge/RELEASE_CHECKLIST.md

Verwandte Skills

rituann/marketmind

AI-powered fintech intelligence agent: LangGraph + MCP + RAG + Groq. Live market data + regulatory compliance, streamed via SSE.

community

dai/o-sumo

大相撲APIs, Skillsを公開します(番付|星取|取組予定|取組結果)令和8年3月場所から -- Sumo data, Shikona dictionaries, and scores.

community

marrocmau/swift-ios-skills

Claude Code skills for iOS: Swift, SwiftUI, UIKit, Core Data, Xcode, TestFlight, App Store — a community-driven collection to supercharge AI-assisted iOS development. 🆓 Free & open source (MIT)

community

baixianger/snowball-cli

Xueqiu stock data CLI for AI agents — 30 commands for A-shares, HK, US stocks and funds.

community

appautomaton/docx

Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks

community

brightdata/brightdata-cli

Guide for using the Bright Data CLI (`brightdata` / `bdata`) to scrape websites, search the web, extract structured data from 40+ platforms, manage proxy zones, and check account budget. Use this skill whenever the user wants to scrape a URL, search Google/Bing/Yandex, extract data from Amazon/LinkedIn/Instagram/TikTok/YouTube/Reddit or any other platform, check their Bright Data balance or zones, or do anything involving web data collection from the terminal. Also trigger when the user mentions brightdata, bdata, web scraping CLI, SERP API, or wants to install Bright Data skills into their coding agent.

community