You can’t manage what you can’t measure. Evaluation is the most underrated skill in LLM engineering. Today you learn every tool in the measurement toolkit.— Day 35 Principle
I. Evaluation Methods
Perplexity (lower=better) measures language modeling quality. Benchmarks (MMLU, HellaSwag, HumanEval) test specific capabilities. Human evaluation remains the gold standard for open-ended quality.
# Perplexity
perplexity = torch.exp(loss)
# MMLU: multiple-choice accuracy across 57 subjects
# HumanEval: code generation pass@k
# LLM-as-judge: use GPT-4 to rate outputs
When Benchmarks Lie
Models can overfit to benchmarks through data contamination. A model that scores 90% on MMLU may have seen the test questions during training. Always use multiple evaluation methods and be skeptical of single-metric claims.
V. Deliverables
- Perplexity computation
- MMLU evaluation
- HumanEval for code
- LLM-as-judge
- Human evaluation protocol
- Phase 3 review
Phase 3 complete. You understand LLM architecture, training, alignment, and evaluation at production depth. Phase 4: RAG.— Day 35 Closing