Day 35 - LLM Evaluation — Perplexity, Benchmarks & Human Eval

1-343536···80

You can’t manage what you can’t measure. Evaluation is the most underrated skill in LLM engineering. Today you learn every tool in the measurement toolkit.— Day 35 Principle

I. Evaluation Methods

Perplexity (lower=better) measures language modeling quality. Benchmarks (MMLU, HellaSwag, HumanEval) test specific capabilities. Human evaluation remains the gold standard for open-ended quality.

# Perplexity
perplexity = torch.exp(loss)
# MMLU: multiple-choice accuracy across 57 subjects
# HumanEval: code generation pass@k
# LLM-as-judge: use GPT-4 to rate outputs

When Benchmarks Lie

Models can overfit to benchmarks through data contamination. A model that scores 90% on MMLU may have seen the test questions during training. Always use multiple evaluation methods and be skeptical of single-metric claims.

V. Deliverables

Perplexity computation
MMLU evaluation
HumanEval for code
LLM-as-judge
Human evaluation protocol
Phase 3 review

Phase 3 complete. You understand LLM architecture, training, alignment, and evaluation at production depth. Phase 4: RAG.— Day 35 Closing