Abstract. An LLM evaluation is a measurement instrument. Like all instruments, it has a signal model, a noise floor, systematic biases, and a validity boundary beyond which its readings are meaningless. This document derives the mathematics of each: the decomposition of a benchmark score into signal, variance, and contamination terms; inter-rater reliability measures for human evaluation; the calibration gap between stated confidence and empirical accuracy; Goodhart's Law formalised as optimisation-induced proxy decoupling; and the construct validity chain from observable behavior to inferred capability to alignment. The goal is to give practitioners the formal tools to characterise precisely what a score means — and when it means nothing.
§ 1
The Evaluation Taxonomy
Three distinct measurement objects — capability, alignment, deployment behavior
The LLM evaluation literature conflates three fundamentally distinct objects, often without acknowledging that they require different measurement instruments, different validity assumptions, and different statistical treatments. Before measuring, one must define precisely what is being measured.
Exhibit 1 — Evaluation Taxonomy: Three Objects, Five Modes, One Critical Error full map
The critical error in LLM evaluation practice: treating all three columns as interchangeable. A model that tops MMLU (capability) can simultaneously fail TruthfulQA (alignment) and underperform in production (deployment). Scores from different columns cannot be averaged, ranked, or compared without an explicit aggregation model that assigns weights — and that weighting is a normative judgment, not a technical one.
§ 2
The Measurement Model
Decomposing a benchmark score into signal, variance, and bias terms
An evaluation score is not a direct reading of model capability. It is a measurement with a noise floor, systematic biases, and a construct validity gap. The correct model is:
Score Decomposition — what a benchmark score S actually containsS = theta + epsilon_sampling + epsilon_prompt + delta_contamination + delta_constructtheta = true underlying capability on the measured constructepsilon_sampling = random error from finite test set (n questions)epsilon_prompt = variance from prompt wording, format, few-shot examplesdelta_contamination = systematic positive bias from train/test overlapdelta_construct = systematic bias from construct invalidity (benchmark does not measure what it claims to measure)
Sampling error. For a binary accuracy metric on n i.i.d. questions with true accuracy p, the observed accuracy S = k/n is the MLE. By the CLT:
Sampling Variance — confidence interval on benchmark accuracySE(S) = sqrt( p(1-p) / n ) ~= sqrt( S(1-S) / n )95% CI: S +- 1.96 * sqrt( S(1-S) / n )n=1000, S=0.85: SE ~= 0.011 | 95% CI = [0.828, 0.872] (tight)n=100, S=0.85: SE ~= 0.036 | 95% CI = [0.780, 0.920] (useless)Most MMLU subcategories: n ~= 100-300. Many reported improvements lie within SE.Minimum detectable difference at n=1000, p=0.85: delta_min ~= 2*SE ~= 2.2pp
Prompt variance. The same model on the same questions can produce accuracy scores varying by 5–15 percentage points across prompt formulations, few-shot example choice, and system prompt content. This variance is rarely reported. A model's score on a benchmark is implicitly conditioned on a specific prompt template — a hidden degree of freedom that is not part of the model's capability.
Prompt Sensitivity — score as a random variable over prompt space PiS(pi) = (1/n) sum_i 1[ f(x_i; pi) = y_i ]pi in Pi = prompt template (few-shot examples, instruction wording, output format)E_pi[S(pi)] =/= theta in generalVar_pi[S(pi)] = prompt sensitivity — rarely reported, often largeCalibrated reporting: mean S_bar = E_pi[S], std sigma_pi = sqrt(Var_pi[S(pi)])over a distribution of k>=5 reasonable prompt templates.
§ 3
Benchmark Contamination
The mathematics of train/test overlap and score inflation
Contamination is the most consequential systematic bias in LLM evaluation. If benchmark questions or their paraphrases appear in pretraining data, the model has effectively memorised answers rather than reasoning from capability. The score inflates by delta_c, which is invisible in the headline number.
Contamination Bias — formal decomposition and magnitudec = |{i : dist(x_i, D_train) < tau}| / |B| (contamination rate)dist(x, D) = min_{d in D} edit_distance(x,d) or 1 - max_d cos_sim(embed(x), embed(d))S_c = c * p_mem + (1-c) * p_clean (contaminated score)delta_c = S_c - p_clean = c * (p_mem - p_clean) (contamination bias)p_mem = accuracy on contaminated questions (near-perfect if memorised)p_clean = accuracy on uncontaminated questions (true capability)Worst case: c=0.30, p_mem=0.95, p_clean=0.70:delta_c = 0.30 * 0.25 = +7.5pp — invisible in the headline score.
Exhibit 2 — Contamination Detection Methods and Statistical Power four approaches
Prospective — must be done before training begins.
Cannot be applied retroactively to deployed models.
Provides controlled estimate of model's memorisation rate.
Gold standard for controlled studies; impractical for external evaluators.
Membership Inference Attack
LR(x) = log p_model(x) / log p_ref(x) >= lambda → in train. Min-K% Prob: use lowest-probability tokens
Black-box: requires only API access. Retroactive.
High false positive rate — requires per-model calibration.
Min-K% Prob (Shi et al. 2024) more robust than mean perplexity.
Only available retroactive method for closed-weight models.
None of these methods have simultaneously high precision and recall for the general contamination problem. The structural solution is dynamic benchmarks: LiveBench, LMSYS Chatbot Arena live evaluations, and similar approaches generate evaluation data after the model's training cutoff — making contamination structurally impossible.
§ 4
Inter-Rater Reliability and Human Evaluation
Cohen's kappa, Krippendorff's alpha, and where annotation breaks down
For alignment, safety, and preference collection, there is no deterministic ground truth — humans must judge. Human annotation introduces agreement noise that must be quantified before a label can be treated as a signal.
Krippendorff's alpha — generalised IRR for k raters, any scale typealpha = 1 - D_o / D_eD_o = observed disagreementD_e = expected disagreement by chance from marginal label distributionHandles missing data, ordinal/interval scales, k>2 raters.Krippendorff minimum: alpha >= 0.667 tentative, alpha >= 0.800 reliable.
Exhibit 3 — Observed kappa/alpha Values Across LLM Evaluation Tasks where agreement lands in practice
Toxicity / harmborderline cases
kappa ~0.20-0.38
Fair at best. Disagreement concentrates on borderline cases — exactly the cases that matter most for safety decisions.
Helpfulnessuser preference
kappa ~0.40-0.55
Moderate. Raters agree on clearly good/bad responses but diverge on length, formality, depth vs. conciseness.
Factual accuracyverifiable claims
kappa ~0.60-0.75
Substantial for unambiguous factual claims. Degrades for domain expertise or compound statements.
Code correctnesspass/fail execution
kappa ~0.85-0.95
Near-perfect — because raters are replaced by test runners. The lesson: replace human annotation with deterministic oracles wherever possible.
Math / logicexact answer
kappa ~0.88-0.97
Near-perfect at final-answer level. Degrades substantially when checking intermediate reasoning steps.
IRR is highest precisely when evaluation is least needed (problems with known answers) and lowest when evaluation matters most (safety, alignment, nuanced preference).
§ 5
LLM-as-Judge: Bias Structure and Validity Conditions
The mathematics of using a model to evaluate another model
Using a strong LLM as an automated judge has become the dominant approach for alignment and quality evaluation, replacing costly human annotation. The method has measurable failure modes that must be explicitly corrected.
Judge Score Decomposition — full bias structurescore_judge(A vs B) = theta_true + bias_position + bias_verbosity + bias_self_pref + bias_format + epsilonbias_position ~ 10-20pp systematic preference for response in position Abias_verbosity = preference for longer responses independent of qualitybias_self_pref = judge prefers outputs stylistically similar to its own generationbias_format = preference for markdown, headers, bullet pointsCalibration: rho(score_judge, score_human) ~= 0.60-0.80 (Spearman) for GPT-4 judge
Position Bias Correction
Compare A-then-B and B-then-A; the debiased win rate is the geometric mean sqrt(p * q) where p = win rate in position 1, q = win rate in position 2. Cost: 2× evaluations. Correction for verbosity bias (AlpacaEval 2.0 LC): regress out response length from judge score — score_corrected = score - beta * length.
Self-Preference Validity Threat
A GPT-4 judge evaluating GPT-4 outputs has a circular validity problem: the judge's preferences are not independent of the evaluated model's generation distribution. Solution: use a judge from a different model family, or use multiple diverse judges.
§ 6
Calibration
Expected calibration error, reliability diagrams, and what RLHF does to uncertainty
A model is calibrated if its stated confidence equals its empirical accuracy: when it says it is 80% confident, it should be correct 80% of the time. Calibration is a property of uncertainty estimates, separate from accuracy.
Calibration Error Measures — ECE, MCE, and proper scoring rulesExpected Calibration Error (ECE):ECE = sum_{b=1}^{B} (|B_b| / n) * |acc(B_b) - conf(B_b)|B_b = samples with predicted confidence in bin bECE in [0,1]. Well-calibrated: ECE ~= 0.02-0.05.Maximum Calibration Error (MCE):MCE = max_b |acc(B_b) - conf(B_b)|Proper Scoring Rules:NLL = -(1/n) sum_i [y_i * log(p_i) + (1-y_i) * log(1-p_i)]BS = (1/n) sum_i (p_i - y_i)^2 (Brier Score)Both uniquely minimised by p_i = P(y_i=1|x_i). ECE is NOT a proper scoring rule.Post-hoc calibration — temperature scaling:T* = arg min_{T>0} NLL( sigma(logits / T), y )Single parameter T fitted on held-out calibration set. Does not change accuracy.
Exhibit 4 — Reliability Diagrams: Three Calibration Pathologies confidence vs empirical accuracy
RLHF training systematically degrades calibration toward underconfidence. Temperature scaling corrects this without changing accuracy: find T* = arg min NLL(sigma(logits/T), y) on a held-out calibration set.
§ 7
Goodhart's Law — The Mathematics of Metric Collapse
When a measure becomes a target, the correlation to the underlying construct collapses
Goodhart's Law (1975), formalised for ML by Krakovna et al. (2020) and Gao et al. (2022), describes the failure mode of optimising for a proxy metric: as optimisation pressure increases, the proxy decouples from the underlying construct it was designed to measure.
Goodhart's Law — formal statement and overoptimisation modelLet M : Theta → R be a metric (e.g., MMLU accuracy, reward model score)Let U : Theta → R be true utility (actual capability, alignment, safety)Goodhart's Law: lim_{M(theta) → M_max} Corr(M(theta), U(theta)) → 0Formalised as overoptimisation (Gao et al. 2022):U(theta) ~= alpha * sqrt( KL(theta || theta_0) ) - beta * KL(theta || theta_0)True utility increases as sqrt(KL) (sublinear), then decreases as -beta*KL.This is the theoretical basis for KL penalties in RLHF.
Exhibit 5 — The Overoptimisation Curve: Metric vs. Utility Under Increasing KL Divergence Gao et al. 2022
Four empirical LLM instances: (1) MMLU saturation; (2) sycophancy from RLHF; (3) HumanEval gaming; (4) TruthfulQA/RLHF confident falsehoods. The operational fix is regular benchmark rotation.
§ 8
Construct Validity
The chain from observable behavior to inferred meaning — four links that each can break
Construct validity (Cronbach and Meehl, 1955) is the degree to which an instrument measures the theoretical construct it purports to measure. In LLM evaluation, this is the core problem: does a score on benchmark B actually measure capability or alignment property C?
Exhibit 6 — Construct Validity Chain: Four Inference Levels from string output to claimed value
Observable behavior — the literal token sequence output
The string generated by the model for input x. Not the model's "reasoning," not its "beliefs" — only the token sequence.
MEASURED DIRECTLY
Surface task performance — accuracy on the specific benchmark format
Does the output match the expected answer format? Failure modes: MCQ vs. free-form format sensitivity. Evidence required: cross-format replication.
INFERRED — STEP 1
Capability — the underlying cognitive skill the benchmark probes
Threats: contamination, format gaming, shortcut learning. Evidence required: cross-benchmark correlation, transfer to novel instances.
INFERRED — STEP 2
Alignment — the normative property (safety, helpfulness, honesty)
MMLU measures knowledge breadth, not whether the model uses that knowledge helpfully. IRR is lowest at this level because it requires normative agreement.
INFERRED — STEP 3
Values / character — what the model does under adversarial pressure
No static benchmark can fully evaluate this level — it requires ongoing adversarial red-teaming, behavioral monitoring, and mechanistic interpretability.
INFERRED — STEP 4
Each inference step multiplies uncertainty. A high MMLU score (steps 1-2) says little about deployment alignment (step 4). Never report a capability score as evidence for an alignment claim without explicit justification of the inference steps.
Convergent and discriminant validity. A valid measure of construct C should: (a) correlate with other measures of the same construct (convergent validity) and (b) not correlate with measures of different constructs (discriminant validity). MMLU, ARC, and HellaSwag correlate at r ≈ 0.85–0.95 across models, supporting convergent validity for "language understanding." But helpfulness and harmlessness should be approximately independent — yet RLHF-trained models show a confounded tradeoff driven by sycophancy.
§ 9
When to Trust a Score — The Trust Matrix
Decision framework across five validity dimensions, with benchmark audit
A benchmark score is trustworthy only when all five validity axes are controlled. A score failing three or more should not be used to make comparative capability claims.
Harness complexity — test environment errors can mask failures.
No widely-used benchmark achieves high trust on all five dimensions simultaneously. The best-performing are Chatbot Arena (live data, large n, diverse prompts) and SWE-bench (deterministic oracle, real engineering tasks). MMLU — the most commonly cited — has high contamination risk and a large construct validity gap.