azrabano23/evalkit
Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
Evaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
npx skills add azrabano23/evalkitEvaluate LLMs the right way — confidence intervals, unbiased pass@k, significance testing, bias-controlled LLM-as-judge, contamination checks. A drop-in agent skill with a numpy stats core validated against ground truth.
Curated skills registry SDK for AgentOS — lazy-loading DI of SKILL.md prompt modules.
Write integration tests using TestContainers for .NET with xUnit. Covers infrastructure testing with real databases, message queues, and caches in Docker containers instead of mocks.
跨平台多智能体 AI 编程助手配置集 + MCP 编排引擎。从想法到交付的完整软件开发流水线,支持 Claude Code / OpenCode / Codex 三平台。
Programmatic input mapping structures translating algorithmic criteria into reliable structured text generation directives.
Idiomatic Kotlin patterns, best practices, and conventions for building robust, efficient, and maintainable Kotlin applications with coroutines, null safety, and DSL builders.
一个 Claude / OpenClaw Skill,专治"想做个 Skill 但说不清需求"。通过引导式讨论帮你把模糊想法变成精确的 Skill 规格,自动生成带版本管理和讨论记录的完整 Skill 目录。