CommunityCoding & Developmentgithub.com

azrabano23/evalkit

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

What is evalkit?

evalkit is a Claude Code agent skill that evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

Works with~Claude Code~Codex CLI~Cursor

Part ofAgent Workflows

npx skills add azrabano23/evalkit

Installed? Explore more Coding & Development skills: steipete/bluebubbles, steipete/eightctl, steipete/blucli · View all 6 →

View original→Browse all skills

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

ChatGPT Claude Gemini Grok Perplexity DeepSeek

Documentation

What does evalkit do?

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

Related Skills

steipete/bluebubbles

Send and manage iMessages via BlueBubbles, including attachments, tapbacks, edits, replies, and groups.

steipete/eightctl

Control Eight Sleep pods (status, temperature, alarms, schedules).

steipete/blucli

BluOS CLI (blu) for discovery, playback, grouping, and volume.

steipete/bear-notes

Create, search, and manage Bear notes via grizzly CLI.

steipete/camsnap

Capture frames or clips from RTSP/ONVIF cameras.

steipete/gifgrep

Search GIF providers with CLI/TUI, download results, and extract stills/sheets.

← More Coding & Development skills