Community코딩 & 개발github.com

azrabano23/evalkit

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

지원 대상~Claude Code~Codex CLI~Cursor
npx skills add azrabano23/evalkit

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

문서

azrabano23/evalkit

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

관련 스킬