CommunityDéveloppement et programmationgithub.com

azrabano23/evalkit

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

Qu'est-ce que evalkit ?

evalkit is a Claude Code agent skill that evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

Compatible avec~Claude Code~Codex CLI~Cursor

Part ofAgent Workflows

npx skills add azrabano23/evalkit

Installed? Explore more Développement et programmation skills: steipete/bluebubbles, steipete/eightctl, steipete/blucli · View all 6 →

Voir l'original→Voir toutes les compétences

Demander à votre IA préférée

Ouvre une nouvelle conversation avec cette compétence d'agent déjà préchargée.

ChatGPT Claude Gemini Grok Perplexity DeepSeek

Documentation

Que fait evalkit ?

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

Skills associés

steipete/bluebubbles

Send and manage iMessages via BlueBubbles, including attachments, tapbacks, edits, replies, and groups.

steipete/eightctl

Control Eight Sleep pods (status, temperature, alarms, schedules).

steipete/blucli

BluOS CLI (blu) for discovery, playback, grouping, and volume.

steipete/bear-notes

Create, search, and manage Bear notes via grizzly CLI.

steipete/camsnap

Capture frames or clips from RTSP/ONVIF cameras.

steipete/gifgrep

Search GIF providers with CLI/TUI, download results, and extract stills/sheets.

← More Développement et programmation skills