azrabano23/evalkit

Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.

対応~Claude Code~Codex CLI~Cursor

npx skills add azrabano23/evalkit

オリジナルを見る→すべてのスキルを見る

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

ChatGPT Claude Gemini Grok Perplexity DeepSeek

ドキュメント

azrabano23/evalkit

関連スキル

framerslab/agentos-skills-registry

Curated skills registry SDK for AgentOS — lazy-loading DI of SKILL.md prompt modules.

community

testcontainers-integration-tests

Write integration tests using TestContainers for .NET with xUnit. Covers infrastructure testing with real databases, message queues, and caches in Docker containers instead of mocks.

community

Wjl1224734792/Jarvis-Agent-Factory

跨平台多智能体 AI 编程助手配置集 + MCP 编排引擎。从想法到交付的完整软件开发流水线，支持 Claude Code / OpenCode / Codex 三平台。

community

stanfordnlp/dspy

Programmatic input mapping structures translating algorithmic criteria into reliable structured text generation directives.

community

affaan-m/kotlin-patterns

Idiomatic Kotlin patterns, best practices, and conventions for building robust, efficient, and maintainable Kotlin applications with coroutines, null safety, and DSL builders.

community

roy668899/skill-architect-zh

一个 Claude / OpenClaw Skill，专治"想做个 Skill 但说不清需求"。通过引导式讨论帮你把模糊想法变成精确的 Skill 规格，自动生成带版本管理和讨论记录的完整 Skill 目录。

community

← More コーディング＆開発 skills