vlm
API-backed vision-language Q&A. Zero GPU: every provider is a remote
endpoint. Images are gap-native uint8 [H, W, 3] numpy arrays, PNG-encoded
on the wire.
Providers
Selected by GAP_VLM_PROVIDER (default openrouter); each tool also accepts
a per-call provider= override.
| Provider | Backend | Config (env) |
|---|---|---|
openrouter | OpenRouter's OpenAI-compatible chat-completions API | OPENROUTER_API_KEY (or GAP_VLM_API_KEY); GAP_VLM_MODEL (default gemini-3.1-flash-lite-preview, see DEFAULT_MODEL in tools.py); set GAP_VLM_BASE_URL to point at another OpenAI-compatible server (e.g. a local vLLM) |
vertex | Vertex AI via google-genai (Gemini models) | GAP_VLM_MODEL, GAP_VLM_PROJECT_ID, GAP_VLM_REGION |
The vertex provider lazy-imports google-genai — install the engine's vertex
extra first: pip install "graph-as-policy[vertex]".
When to use
- Semantic scene checks and checkpoint verification (
vlm.query_yes_no). - Free-form scene descriptions or attribute queries (
vlm.query). - Prefer
gemini-er.detectwhen you need pixel-space bounding boxes, andmolmo.point_promptwhen you need a single click point.
Notes
vlm.query_yes_nocoerces with the source-verbatim rule: answer is true iff"yes"appears in the lowercased reply.- Requests carry no system prompt and no temperature knob (mirrors the
original
vlm.v1proto); both providers pintemperature: 0.0.