azrabano23/evalkit
Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
npx add-skill azrabano23/evalkitEvaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
One file. 9 questions. Stops AI agents from doing the obviously-wrong thing. MIT. Works with Claude Code, Codex, Cursor, OpenClaw.
Enable autonomous AI agents to collaborate, run experiments, and innovate together on shared codebases through an agent-focused platform.
Claude Code AgentOS — operator + dispatcher template with skills (Russian, batteries-included)
Automate Scrapingbee tasks via Rube MCP (Composio). Always search tools first for current schemas.
terminally online
Open-source AI-first Identity and Access Management with MCP, A2A, OAuth 2.1, OIDC, SAML, WebAuthn, TOTP, and MFA support.