Communityコーディング&開発github.com

azrabano23/evalkit

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

対応~Claude Code~Codex CLI~Cursor
npx skills add azrabano23/evalkit

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

ドキュメント

azrabano23/evalkit

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

関連スキル