azrabano23/evalkit
Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
npx skills add azrabano23/evalkitEvaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
Zero-dependency Claude Code plugin that catches speculation, invented causality, and fake citations before they pollute your context. Install in one command, works offline, no API keys needed.
Sanitized Codex skill for medical case report web decks
Agent skills and eval prompts for vibe-coding plans, review loops, commit messages, prose, and Minecraft modding.
Local Claude Code factory that turns source documents into validated, provenance-tracked subagent packages.
An openclaw skill to use x402 Singularity layer and enable your agents to consume and create x402 enabled endpoints.
MDD — Mestre de Direção Dinâmica: skill do Claude Code que transforma qualquer assunto em pacote completo de direção de vídeo (storyboard + prompts)