Communitygithub.com

ebarti/skills

๐Ÿ“š Agent skills distilled from technical books โ€” AI Engineering, Context Engineering, Designing Data-Intensive Applications, and more. Agent-agnostic, plain Markdown. Give your AI agent a bookshelf.

์ง€์› ๋Œ€์ƒ~Claude Code~Codex CLI~Cursor
npx skills add ebarti/skills

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

๋ฌธ์„œ

AI Evaluation

Knowledge from "AI Engineering" by Chip Huyen (Chapters 3-4). Practical methods for evaluating foundation models and AI systems built on top of them.

Quick Start

  1. Check guidelines.md to find which files to load for your task
  2. Load only relevant files (each topic has knowledge.md, rules.md, examples.md)
  3. Apply guidance to your work

Contents

References

CategoryPurpose
language-modeling-metricsEntropy, cross-entropy, perplexity, bits-per-character
exact-evaluationFunctional correctness, exact match, lexical/semantic similarity, embeddings
ai-as-judgeWhen to use AI judges, how to prompt them, limitations and biases
comparative-evaluationRanking models with pairwise comparisons, Bradley-Terry, scalability challenges
evaluation-criteriaDomain capability, generation (factual, safety), instruction-following, cost/latency
model-selectionSelection workflow, open source vs API, navigating public benchmarks
evaluation-pipelineEnd-to-end pipeline design, scoring rubrics, evaluation methods

Workflows

TaskWorkflow
Choose a model (build vs buy, OS vs API)workflows/select-model.md
Design an end-to-end evaluation pipelineworkflows/design-eval-pipeline.md

Guidelines

See guidelines.md for task-based file selection.

๊ด€๋ จ ์Šคํ‚ฌ