← Back
Mathematical Deep Dive Measurement Theory NLP Evaluation February 2026

LLM Evaluation A mathematical treatment of what benchmark scores measure,
how they break, and when to trust them

Pure Maths
No PM Framing
Sections: 9
Exhibits: 8
Scope: measurement theory,
contamination, calibration,
construct validity, trust matrix
Abstract. An LLM evaluation is a measurement instrument. Like all instruments, it has a signal model, a noise floor, systematic biases, and a validity boundary beyond which its readings are meaningless. This document derives the mathematics of each: the decomposition of a benchmark score into signal, variance, and contamination terms; inter-rater reliability measures for human evaluation; the calibration gap between stated confidence and empirical accuracy; Goodhart's Law formalised as optimisation-induced proxy decoupling; and the construct validity chain from observable behavior to inferred capability to alignment. The goal is to give practitioners the formal tools to characterise precisely what a score means — and when it means nothing.
§ 1
The Evaluation Taxonomy
Three distinct measurement objects — capability, alignment, deployment behavior

The LLM evaluation literature conflates three fundamentally distinct objects, often without acknowledging that they require different measurement instruments, different validity assumptions, and different statistical treatments. Before measuring, one must define precisely what is being measured.

Exhibit 1 — Evaluation Taxonomy: Three Objects, Five Modes, One Critical Error full map
CAPABILITY ALIGNMENT DEPLOYMENT BEHAVIOR Knowledge breadth MMLU, TriviaQA, NaturalQQ Reasoning depth GSM8K, MATH, ARC, BIG-Bench Code generation HumanEval, MBPP, SWE-bench Language understanding GLUE, SuperGLUE, HellaSwag Long-context recall NIAH (needle-in-haystack) Instruction following IFEval, FollowBench Measure: automated, deterministic Ground truth: objective Primary threat: contamination Harmlessness ToxiGen, BBQ, WinoBias Honesty / calibration TruthfulQA, ECE measurement Helpfulness MT-Bench, AlpacaEval, Chatbot Arena Value alignment ETHICS benchmark, ValueBench Refusal appropriateness StrongREJECT, WildGuard Sycophancy TruthfulQA adversarial, Perez 2023 Measure: human + LLM-as-judge Ground truth: contested, normative Primary threat: label ambiguity, IRR Latency / throughput TTFT, tokens/sec — not eval quality Perturbation robustness PromptBench, Advglue Consistency / variance sigma across temperature, seeds Task-specific accuracy Domain benchmarks, A/B tests Cost-quality frontier Pareto curve vs. API cost Human pref in context Live A/B, RLHF reward models Measure: production systems Ground truth: business outcome Primary threat: distribution shift
The critical error in LLM evaluation practice: treating all three columns as interchangeable. A model that tops MMLU (capability) can simultaneously fail TruthfulQA (alignment) and underperform in production (deployment). Scores from different columns cannot be averaged, ranked, or compared without an explicit aggregation model that assigns weights — and that weighting is a normative judgment, not a technical one.
§ 2
The Measurement Model
Decomposing a benchmark score into signal, variance, and bias terms

An evaluation score is not a direct reading of model capability. It is a measurement with a noise floor, systematic biases, and a construct validity gap. The correct model is:

Score Decomposition — what a benchmark score S actually contains S = theta + epsilon_sampling + epsilon_prompt + delta_contamination + delta_construct theta = true underlying capability on the measured construct epsilon_sampling = random error from finite test set (n questions) epsilon_prompt = variance from prompt wording, format, few-shot examples delta_contamination = systematic positive bias from train/test overlap delta_construct = systematic bias from construct invalidity (benchmark does not measure what it claims to measure)

Sampling error. For a binary accuracy metric on n i.i.d. questions with true accuracy p, the observed accuracy S = k/n is the MLE. By the CLT:

Sampling Variance — confidence interval on benchmark accuracy SE(S) = sqrt( p(1-p) / n ) ~= sqrt( S(1-S) / n ) 95% CI: S +- 1.96 * sqrt( S(1-S) / n ) n=1000, S=0.85: SE ~= 0.011 | 95% CI = [0.828, 0.872] (tight) n=100, S=0.85: SE ~= 0.036 | 95% CI = [0.780, 0.920] (useless) Most MMLU subcategories: n ~= 100-300. Many reported improvements lie within SE. Minimum detectable difference at n=1000, p=0.85: delta_min ~= 2*SE ~= 2.2pp

Prompt variance. The same model on the same questions can produce accuracy scores varying by 5–15 percentage points across prompt formulations, few-shot example choice, and system prompt content. This variance is rarely reported. A model's score on a benchmark is implicitly conditioned on a specific prompt template — a hidden degree of freedom that is not part of the model's capability.

Prompt Sensitivity — score as a random variable over prompt space Pi S(pi) = (1/n) sum_i 1[ f(x_i; pi) = y_i ] pi in Pi = prompt template (few-shot examples, instruction wording, output format) E_pi[S(pi)] =/= theta in general Var_pi[S(pi)] = prompt sensitivity — rarely reported, often large Calibrated reporting: mean S_bar = E_pi[S], std sigma_pi = sqrt(Var_pi[S(pi)]) over a distribution of k>=5 reasonable prompt templates.
§ 3
Benchmark Contamination
The mathematics of train/test overlap and score inflation

Contamination is the most consequential systematic bias in LLM evaluation. If benchmark questions or their paraphrases appear in pretraining data, the model has effectively memorised answers rather than reasoning from capability. The score inflates by delta_c, which is invisible in the headline number.

Contamination Bias — formal decomposition and magnitude c = |{i : dist(x_i, D_train) < tau}| / |B| (contamination rate) dist(x, D) = min_{d in D} edit_distance(x,d) or 1 - max_d cos_sim(embed(x), embed(d)) S_c = c * p_mem + (1-c) * p_clean (contaminated score) delta_c = S_c - p_clean = c * (p_mem - p_clean) (contamination bias) p_mem = accuracy on contaminated questions (near-perfect if memorised) p_clean = accuracy on uncontaminated questions (true capability) Worst case: c=0.30, p_mem=0.95, p_clean=0.70: delta_c = 0.30 * 0.25 = +7.5pp — invisible in the headline score.
Exhibit 2 — Contamination Detection Methods and Statistical Power four approaches
N-Gram Overlap (Exact)
c_exact = |{i : ngrams(x_i) ∩ ngrams(D_train) >= theta}| / |B|
Fast, deterministic. 13-gram match threshold is standard.
Misses semantic paraphrases — underestimates contamination.
Used by: OpenAI (GPT-4 report), most open model releases.
Limitation: same question with different phrasing evades detection entirely.
Embedding Similarity (Semantic)
c_sem = |{i : max_{d in D} cos(embed(x_i), embed(d)) >= tau}| / |B|
Catches paraphrases. tau in [0.85, 0.95] typical.
O(|B| x |D|) comparisons — expensive at web-corpus scale.
Embedding space may not capture answer-level similarity.
Better than n-gram but misses concept-level contamination.
Canary Insertion
Insert synthetic Q_canary into train; P(memorised) = P(model answers Q_canary correctly)
Prospective — must be done before training begins.
Cannot be applied retroactively to deployed models.
Provides controlled estimate of model's memorisation rate.
Gold standard for controlled studies; impractical for external evaluators.
Membership Inference Attack
LR(x) = log p_model(x) / log p_ref(x) >= lambda → in train. Min-K% Prob: use lowest-probability tokens
Black-box: requires only API access. Retroactive.
High false positive rate — requires per-model calibration.
Min-K% Prob (Shi et al. 2024) more robust than mean perplexity.
Only available retroactive method for closed-weight models.
None of these methods have simultaneously high precision and recall for the general contamination problem. The structural solution is dynamic benchmarks: LiveBench, LMSYS Chatbot Arena live evaluations, and similar approaches generate evaluation data after the model's training cutoff — making contamination structurally impossible.
§ 4
Inter-Rater Reliability and Human Evaluation
Cohen's kappa, Krippendorff's alpha, and where annotation breaks down

For alignment, safety, and preference collection, there is no deterministic ground truth — humans must judge. Human annotation introduces agreement noise that must be quantified before a label can be treated as a signal.

Cohen's kappa — agreement beyond chance, two raters, categorical labels kappa = (P_o - P_e) / (1 - P_e) P_o = observed agreement = sum_i p_ii P_e = expected chance agreement = sum_i p_i. * p_.i Interpretation thresholds (Landis and Koch 1977): kappa < 0.20 slight kappa in [0.20,0.40) fair kappa in [0.40,0.60) moderate kappa in [0.60,0.80) substantial kappa >= 0.80 near-perfect
Krippendorff's alpha — generalised IRR for k raters, any scale type alpha = 1 - D_o / D_e D_o = observed disagreement D_e = expected disagreement by chance from marginal label distribution Handles missing data, ordinal/interval scales, k>2 raters. Krippendorff minimum: alpha >= 0.667 tentative, alpha >= 0.800 reliable.
Exhibit 3 — Observed kappa/alpha Values Across LLM Evaluation Tasks where agreement lands in practice
Toxicity / harmborderline cases
kappa ~0.20-0.38
Fair at best. Disagreement concentrates on borderline cases — exactly the cases that matter most for safety decisions.
Helpfulnessuser preference
kappa ~0.40-0.55
Moderate. Raters agree on clearly good/bad responses but diverge on length, formality, depth vs. conciseness.
Factual accuracyverifiable claims
kappa ~0.60-0.75
Substantial for unambiguous factual claims. Degrades for domain expertise or compound statements.
Code correctnesspass/fail execution
kappa ~0.85-0.95
Near-perfect — because raters are replaced by test runners. The lesson: replace human annotation with deterministic oracles wherever possible.
Math / logicexact answer
kappa ~0.88-0.97
Near-perfect at final-answer level. Degrades substantially when checking intermediate reasoning steps.
IRR is highest precisely when evaluation is least needed (problems with known answers) and lowest when evaluation matters most (safety, alignment, nuanced preference).
§ 5
LLM-as-Judge: Bias Structure and Validity Conditions
The mathematics of using a model to evaluate another model

Using a strong LLM as an automated judge has become the dominant approach for alignment and quality evaluation, replacing costly human annotation. The method has measurable failure modes that must be explicitly corrected.

Judge Score Decomposition — full bias structure score_judge(A vs B) = theta_true + bias_position + bias_verbosity + bias_self_pref + bias_format + epsilon bias_position ~ 10-20pp systematic preference for response in position A bias_verbosity = preference for longer responses independent of quality bias_self_pref = judge prefers outputs stylistically similar to its own generation bias_format = preference for markdown, headers, bullet points Calibration: rho(score_judge, score_human) ~= 0.60-0.80 (Spearman) for GPT-4 judge

Position Bias Correction

Compare A-then-B and B-then-A; the debiased win rate is the geometric mean sqrt(p * q) where p = win rate in position 1, q = win rate in position 2. Cost: 2× evaluations. Correction for verbosity bias (AlpacaEval 2.0 LC): regress out response length from judge score — score_corrected = score - beta * length.

Self-Preference Validity Threat

A GPT-4 judge evaluating GPT-4 outputs has a circular validity problem: the judge's preferences are not independent of the evaluated model's generation distribution. Solution: use a judge from a different model family, or use multiple diverse judges.

§ 6
Calibration
Expected calibration error, reliability diagrams, and what RLHF does to uncertainty

A model is calibrated if its stated confidence equals its empirical accuracy: when it says it is 80% confident, it should be correct 80% of the time. Calibration is a property of uncertainty estimates, separate from accuracy.

Calibration Error Measures — ECE, MCE, and proper scoring rules Expected Calibration Error (ECE): ECE = sum_{b=1}^{B} (|B_b| / n) * |acc(B_b) - conf(B_b)| B_b = samples with predicted confidence in bin b ECE in [0,1]. Well-calibrated: ECE ~= 0.02-0.05. Maximum Calibration Error (MCE): MCE = max_b |acc(B_b) - conf(B_b)| Proper Scoring Rules: NLL = -(1/n) sum_i [y_i * log(p_i) + (1-y_i) * log(1-p_i)] BS = (1/n) sum_i (p_i - y_i)^2 (Brier Score) Both uniquely minimised by p_i = P(y_i=1|x_i). ECE is NOT a proper scoring rule. Post-hoc calibration — temperature scaling: T* = arg min_{T>0} NLL( sigma(logits / T), y ) Single parameter T fitted on held-out calibration set. Does not change accuracy.
Exhibit 4 — Reliability Diagrams: Three Calibration Pathologies confidence vs empirical accuracy
OVERCONFIDENT — common in LLMs Confidence → bars below diagonal — ECE high UNDERCONFIDENT — post-RLHF typical Confidence → bars above diagonal — hedging WELL-CALIBRATED — target state Confidence → bars straddle diagonal — ECE < 0.05
RLHF training systematically degrades calibration toward underconfidence. Temperature scaling corrects this without changing accuracy: find T* = arg min NLL(sigma(logits/T), y) on a held-out calibration set.
§ 7
Goodhart's Law — The Mathematics of Metric Collapse
When a measure becomes a target, the correlation to the underlying construct collapses

Goodhart's Law (1975), formalised for ML by Krakovna et al. (2020) and Gao et al. (2022), describes the failure mode of optimising for a proxy metric: as optimisation pressure increases, the proxy decouples from the underlying construct it was designed to measure.

Goodhart's Law — formal statement and overoptimisation model Let M : Theta → R be a metric (e.g., MMLU accuracy, reward model score) Let U : Theta → R be true utility (actual capability, alignment, safety) Goodhart's Law: lim_{M(theta) → M_max} Corr(M(theta), U(theta)) → 0 Formalised as overoptimisation (Gao et al. 2022): U(theta) ~= alpha * sqrt( KL(theta || theta_0) ) - beta * KL(theta || theta_0) True utility increases as sqrt(KL) (sublinear), then decreases as -beta*KL. This is the theoretical basis for KL penalties in RLHF.
Exhibit 5 — The Overoptimisation Curve: Metric vs. Utility Under Increasing KL Divergence Gao et al. 2022
KL DIVERGENCE FROM BASE POLICY → VALUE → utility peak KL penalty keeps policy here M(theta) — proxy metric U(theta) — true utility overoptimisation regime M high, U declining moderate RLHF reward hacking pre-train eval
Four empirical LLM instances: (1) MMLU saturation; (2) sycophancy from RLHF; (3) HumanEval gaming; (4) TruthfulQA/RLHF confident falsehoods. The operational fix is regular benchmark rotation.
§ 8
Construct Validity
The chain from observable behavior to inferred meaning — four links that each can break

Construct validity (Cronbach and Meehl, 1955) is the degree to which an instrument measures the theoretical construct it purports to measure. In LLM evaluation, this is the core problem: does a score on benchmark B actually measure capability or alignment property C?

Exhibit 6 — Construct Validity Chain: Four Inference Levels from string output to claimed value
Observable behavior — the literal token sequence output
The string generated by the model for input x. Not the model's "reasoning," not its "beliefs" — only the token sequence.
MEASURED DIRECTLY
Surface task performance — accuracy on the specific benchmark format
Does the output match the expected answer format? Failure modes: MCQ vs. free-form format sensitivity. Evidence required: cross-format replication.
INFERRED — STEP 1
Capability — the underlying cognitive skill the benchmark probes
Threats: contamination, format gaming, shortcut learning. Evidence required: cross-benchmark correlation, transfer to novel instances.
INFERRED — STEP 2
Alignment — the normative property (safety, helpfulness, honesty)
MMLU measures knowledge breadth, not whether the model uses that knowledge helpfully. IRR is lowest at this level because it requires normative agreement.
INFERRED — STEP 3
Values / character — what the model does under adversarial pressure
No static benchmark can fully evaluate this level — it requires ongoing adversarial red-teaming, behavioral monitoring, and mechanistic interpretability.
INFERRED — STEP 4
Each inference step multiplies uncertainty. A high MMLU score (steps 1-2) says little about deployment alignment (step 4). Never report a capability score as evidence for an alignment claim without explicit justification of the inference steps.

Convergent and discriminant validity. A valid measure of construct C should: (a) correlate with other measures of the same construct (convergent validity) and (b) not correlate with measures of different constructs (discriminant validity). MMLU, ARC, and HellaSwag correlate at r ≈ 0.85–0.95 across models, supporting convergent validity for "language understanding." But helpfulness and harmlessness should be approximately independent — yet RLHF-trained models show a confounded tradeoff driven by sycophancy.

§ 9
When to Trust a Score — The Trust Matrix
Decision framework across five validity dimensions, with benchmark audit

A benchmark score is trustworthy only when all five validity axes are controlled. A score failing three or more should not be used to make comparative capability claims.

Exhibit 7 — Five-Dimension Trust Matrix for Evaluation Scores decision framework
Dimension High Trust Medium Trust Low Trust Diagnostic
1. Sampling variance n > 1000, SE < 1.5pp n = 200–1000, SE 2–4pp n < 200, SE > 4pp SE = sqrt(S(1-S)/n). Improvement must exceed 2×SE.
2. Prompt sensitivity sigma_prompt < 1pp across 5+ templates sigma_prompt 1–5pp, reported sigma_prompt unreported or > 5pp Run 5+ prompt variations; report mean and std.
3. Contamination Benchmark post-dates training cutoff N-gram overlap tested; c < 5% No contamination analysis performed Min-K% Prob or n-gram overlap. Gap >5pp indicates inflation.
4. Annotation reliability Deterministic oracle or kappa ≥ 0.80 Human annotation kappa [0.60, 0.80) Single annotator. kappa < 0.40. For LLM-as-judge: report rho(judge, human) on ≥100 items.
5. Construct validity 3+ benchmarks of same construct agree Single benchmark; limitations documented Capability score cited as alignment evidence Cross-benchmark correlation for convergent validity.
Reporting minimum for a trustworthy comparative claim: score ± SE(n), sigma_prompt over k≥5 templates, contamination rate c, IRR measure, and construct validity inference level.
Exhibit 8 — Major Benchmarks Audited on All Five Dimensions current state of the field
Benchmark n (test) Sampling SE Contamination Ground truth / IRR Construct claim Primary failure mode
MMLU 14,042 total ~150/subcat ±2–4pp per subcategory High. 2021 data, extensively crawled. MCQ automated, deterministic Knowledge breadth → claimed "general intelligence" Contamination + construct leap from MCQ to reasoning.
HumanEval 164 ±4–6pp on pass@1 Moderate. GitHub in training. Unit test execution. Near-perfect. Code generation (narrow scope) n=164 is very small. SWE-bench harder and more realistic.
MT-Bench 80 questions SE large on 80 items Low. Multi-turn, harder to contaminate. GPT-4 judge. rho ~= 0.77 to human. Conversational quality → alignment 80 questions yields large CIs. Judge biases.
Chatbot Arena 100K+ battles SE tiny on Elo ratings Live data post-training Human pref. Selection bias in user population. Human preference on diverse natural prompts User population selection bias. Verbosity inflates win rates.
TruthfulQA 817 ±2–3pp on 817 Adversarially designed to resist memorisation. GPT-4 judge for free-form answers. Honesty on adversarial Qs → general honesty Models learn TruthfulQA distribution. Generalisation untested.
MATH / GSM8K 5000 / 1319 SE small at 5000 Moderate. Math in web text. Deterministic answer matching Mathematical reasoning (specified domain) Final-answer matching misses incorrect reasoning chains.
SWE-bench 2294 (verified: 500) SE < 2pp on verified set GitHub history partial overlap. Test suite pass rate. Deterministic. Real-world software engineering Harness complexity — test environment errors can mask failures.
No widely-used benchmark achieves high trust on all five dimensions simultaneously. The best-performing are Chatbot Arena (live data, large n, diverse prompts) and SWE-bench (deterministic oracle, real engineering tasks). MMLU — the most commonly cited — has high contamination risk and a large construct validity gap.