CommunityCodierung & Entwicklunggithub.com

azrabano23/evalkit

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

Funktioniert mit~Claude Code~Codex CLI~Cursor
npx skills add azrabano23/evalkit

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

Dokumentation

azrabano23/evalkit

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

Verwandte Skills