CommunityProgramación y desarrollogithub.com

azrabano23/evalkit

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

¿Qué es evalkit?

evalkit is a Claude Code agent skill that evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

Compatible con~Claude Code~Codex CLI~Cursor

Part ofAgent Workflows

npx skills add azrabano23/evalkit

Installed? Explore more Programación y desarrollo skills: steipete/bluebubbles, steipete/eightctl, steipete/blucli · View all 6 →

Ver original→Ver todas las habilidades

Preguntar en tu IA favorita

Abre un nuevo chat con esta habilidad de agente ya precargada.

ChatGPT Claude Gemini Grok Perplexity DeepSeek

Documentación

¿Qué hace evalkit?

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

Skills relacionados

steipete/bluebubbles

Send and manage iMessages via BlueBubbles, including attachments, tapbacks, edits, replies, and groups.

steipete/eightctl

Control Eight Sleep pods (status, temperature, alarms, schedules).

steipete/blucli

BluOS CLI (blu) for discovery, playback, grouping, and volume.

steipete/bear-notes

Create, search, and manage Bear notes via grizzly CLI.

steipete/camsnap

Capture frames or clips from RTSP/ONVIF cameras.

steipete/gifgrep

Search GIF providers with CLI/TUI, download results, and extract stills/sheets.

← More Programación y desarrollo skills