azrabano23/evalkit
Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
npx add-skill azrabano23/evalkitEvaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
Export Bilibili follows, classify with ChatGPT/Claude/Gemini or other frontier LLMs, then sync groups back | 导出B站关注列表,交给ChatGPT/Claude/Gemini等通用大模型分类,再同步回分组
🛠️ Explore AI agent architectures with practical design patterns to build robust systems across frameworks.
my agent skills
Repo to track and distribute my claude code skills.
Portable AI agent skills pack — 17 consolidated skills for Claude Code, Cursor, Copilot, Windsurf, and Codex
🚀 Enhance your coding agent with a collection of versatile skills compatible with Claude Code, Codex CLI, Amp, and Droid for improved functionality.