AI Evaluation
Knowledge from "AI Engineering" by Chip Huyen (Chapters 3-4). Practical methods for evaluating foundation models and AI systems built on top of them.
Quick Start
- Check
guidelines.mdto find which files to load for your task - Load only relevant files (each topic has knowledge.md, rules.md, examples.md)
- Apply guidance to your work
Contents
References
| Category | Purpose |
|---|---|
language-modeling-metrics | Entropy, cross-entropy, perplexity, bits-per-character |
exact-evaluation | Functional correctness, exact match, lexical/semantic similarity, embeddings |
ai-as-judge | When to use AI judges, how to prompt them, limitations and biases |
comparative-evaluation | Ranking models with pairwise comparisons, Bradley-Terry, scalability challenges |
evaluation-criteria | Domain capability, generation (factual, safety), instruction-following, cost/latency |
model-selection | Selection workflow, open source vs API, navigating public benchmarks |
evaluation-pipeline | End-to-end pipeline design, scoring rubrics, evaluation methods |
Workflows
| Task | Workflow |
|---|---|
| Choose a model (build vs buy, OS vs API) | workflows/select-model.md |
| Design an end-to-end evaluation pipeline | workflows/design-eval-pipeline.md |
Guidelines
See guidelines.md for task-based file selection.