LLM Evaluation — Mathematical Deep Dive

Abstract. An LLM evaluation is a measurement instrument. Like all instruments, it has a signal model, a noise floor, systematic biases, and a validity boundary beyond which its readings are meaningless. This document derives the mathematics of each: the decomposition of a benchmark score into signal, variance, and contamination terms; inter-rater reliability measures for human evaluation; the calibration gap between stated confidence and empirical accuracy; Goodhart's Law formalised as optimisation-induced proxy decoupling; and the construct validity chain from observable behavior to inferred capability to alignment. The goal is to give practitioners the formal tools to characterise precisely what a score means — and when it means nothing.

The LLM evaluation literature conflates three fundamentally distinct objects, often without acknowledging that they require different measurement instruments, different validity assumptions, and different statistical treatments. Before measuring, one must define precisely what is being measured.

Exhibit 1 — Evaluation Taxonomy: Three Objects, Five Modes, One Critical Error full map

The critical error in LLM evaluation practice: treating all three columns as interchangeable. A model that tops MMLU (capability) can simultaneously fail TruthfulQA (alignment) and underperform in production (deployment). Scores from different columns cannot be averaged, ranked, or compared without an explicit aggregation model that assigns weights — and that weighting is a normative judgment, not a technical one.

An evaluation score is not a direct reading of model capability. It is a measurement with a noise floor, systematic biases, and a construct validity gap. The correct model is:

Score Decomposition — what a benchmark score S actually contains S = theta + epsilon_sampling + epsilon_prompt + delta_contamination + delta_construct theta = true underlying capability on the measured construct epsilon_sampling = random error from finite test set (n questions) epsilon_prompt = variance from prompt wording, format, few-shot examples delta_contamination = systematic positive bias from train/test overlap delta_construct = systematic bias from construct invalidity (benchmark does not measure what it claims to measure)

Sampling error. For a binary accuracy metric on n i.i.d. questions with true accuracy p, the observed accuracy S = k/n is the MLE. By the CLT:

Sampling Variance — confidence interval on benchmark accuracy SE(S) = sqrt( p(1-p) / n ) ~= sqrt( S(1-S) / n ) 95% CI: S +- 1.96 * sqrt( S(1-S) / n ) n=1000, S=0.85: SE ~= 0.011 | 95% CI = [0.828, 0.872] (tight) n=100, S=0.85: SE ~= 0.036 | 95% CI = [0.780, 0.920] (useless) Most MMLU subcategories: n ~= 100-300. Many reported improvements lie within SE. Minimum detectable difference at n=1000, p=0.85: delta_min ~= 2*SE ~= 2.2pp

Prompt variance. The same model on the same questions can produce accuracy scores varying by 5–15 percentage points across prompt formulations, few-shot example choice, and system prompt content. This variance is rarely reported. A model's score on a benchmark is implicitly conditioned on a specific prompt template — a hidden degree of freedom that is not part of the model's capability.

Prompt Sensitivity — score as a random variable over prompt space Pi S(pi) = (1/n) sum_i 1[ f(x_i; pi) = y_i ] pi in Pi = prompt template (few-shot examples, instruction wording, output format) E_pi[S(pi)] =/= theta in general Var_pi[S(pi)] = prompt sensitivity — rarely reported, often large Calibrated reporting: mean S_bar = E_pi[S], std sigma_pi = sqrt(Var_pi[S(pi)]) over a distribution of k>=5 reasonable prompt templates.

Contamination is the most consequential systematic bias in LLM evaluation. If benchmark questions or their paraphrases appear in pretraining data, the model has effectively memorised answers rather than reasoning from capability. The score inflates by delta_c, which is invisible in the headline number.

Contamination Bias — formal decomposition and magnitude c = |{i : dist(x_i, D_train) < tau}| / |B| (contamination rate) dist(x, D) = min_{d in D} edit_distance(x,d) or 1 - max_d cos_sim(embed(x), embed(d)) S_c = c * p_mem + (1-c) * p_clean (contaminated score) delta_c = S_c - p_clean = c * (p_mem - p_clean) (contamination bias) p_mem = accuracy on contaminated questions (near-perfect if memorised) p_clean = accuracy on uncontaminated questions (true capability) Worst case: c=0.30, p_mem=0.95, p_clean=0.70: delta_c = 0.30 * 0.25 = +7.5pp — invisible in the headline score.

Exhibit 2 — Contamination Detection Methods and Statistical Power four approaches

N-Gram Overlap (Exact)

c_exact = |{i : ngrams(x_i) ∩ ngrams(D_train) >= theta}| / |B|

Fast, deterministic. 13-gram match threshold is standard.

Misses semantic paraphrases — underestimates contamination.

Used by: OpenAI (GPT-4 report), most open model releases.

Limitation: same question with different phrasing evades detection entirely.

Embedding Similarity (Semantic)

c_sem = |{i : max_{d in D} cos(embed(x_i), embed(d)) >= tau}| / |B|

Catches paraphrases. tau in [0.85, 0.95] typical.

O(|B| x |D|) comparisons — expensive at web-corpus scale.

Embedding space may not capture answer-level similarity.

Better than n-gram but misses concept-level contamination.

Canary Insertion

Insert synthetic Q_canary into train; P(memorised) = P(model answers Q_canary correctly)

Prospective — must be done before training begins.

Cannot be applied retroactively to deployed models.

Provides controlled estimate of model's memorisation rate.

Gold standard for controlled studies; impractical for external evaluators.

Membership Inference Attack

LR(x) = log p_model(x) / log p_ref(x) >= lambda → in train. Min-K% Prob: use lowest-probability tokens

Black-box: requires only API access. Retroactive.

High false positive rate — requires per-model calibration.

Min-K% Prob (Shi et al. 2024) more robust than mean perplexity.

Only available retroactive method for closed-weight models.

None of these methods have simultaneously high precision and recall for the general contamination problem. The structural solution is dynamic benchmarks: LiveBench, LMSYS Chatbot Arena live evaluations, and similar approaches generate evaluation data after the model's training cutoff — making contamination structurally impossible.

For alignment, safety, and preference collection, there is no deterministic ground truth — humans must judge. Human annotation introduces agreement noise that must be quantified before a label can be treated as a signal.

Exhibit 3 — Observed kappa/alpha Values Across LLM Evaluation Tasks where agreement lands in practice

Toxicity / harmborderline cases

kappa ~0.20-0.38

Fair at best. Disagreement concentrates on borderline cases — exactly the cases that matter most for safety decisions.

Helpfulnessuser preference

kappa ~0.40-0.55

Moderate. Raters agree on clearly good/bad responses but diverge on length, formality, depth vs. conciseness.

Factual accuracyverifiable claims

kappa ~0.60-0.75

Substantial for unambiguous factual claims. Degrades for domain expertise or compound statements.

Code correctnesspass/fail execution

kappa ~0.85-0.95

Near-perfect — because raters are replaced by test runners. The lesson: replace human annotation with deterministic oracles wherever possible.

Math / logicexact answer

kappa ~0.88-0.97

Near-perfect at final-answer level. Degrades substantially when checking intermediate reasoning steps.

IRR is highest precisely when evaluation is least needed (problems with known answers) and lowest when evaluation matters most (safety, alignment, nuanced preference).

Using a strong LLM as an automated judge has become the dominant approach for alignment and quality evaluation, replacing costly human annotation. The method has measurable failure modes that must be explicitly corrected.

Judge Score Decomposition — full bias structure score_judge(A vs B) = theta_true + bias_position + bias_verbosity + bias_self_pref + bias_format + epsilon bias_position ~ 10-20pp systematic preference for response in position A bias_verbosity = preference for longer responses independent of quality bias_self_pref = judge prefers outputs stylistically similar to its own generation bias_format = preference for markdown, headers, bullet points Calibration: rho(score_judge, score_human) ~= 0.60-0.80 (Spearman) for GPT-4 judge

Position Bias Correction

Compare A-then-B and B-then-A; the debiased win rate is the geometric mean sqrt(p * q) where p = win rate in position 1, q = win rate in position 2. Cost: 2× evaluations. Correction for verbosity bias (AlpacaEval 2.0 LC): regress out response length from judge score — score_corrected = score - beta * length.

Self-Preference Validity Threat

A GPT-4 judge evaluating GPT-4 outputs has a circular validity problem: the judge's preferences are not independent of the evaluated model's generation distribution. Solution: use a judge from a different model family, or use multiple diverse judges.

A model is calibrated if its stated confidence equals its empirical accuracy: when it says it is 80% confident, it should be correct 80% of the time. Calibration is a property of uncertainty estimates, separate from accuracy.

Calibration Error Measures — ECE, MCE, and proper scoring rules Expected Calibration Error (ECE): ECE = sum_{b=1}^{B} (|B_b| / n) * |acc(B_b) - conf(B_b)| B_b = samples with predicted confidence in bin b ECE in [0,1]. Well-calibrated: ECE ~= 0.02-0.05. Maximum Calibration Error (MCE): MCE = max_b |acc(B_b) - conf(B_b)| Proper Scoring Rules: NLL = -(1/n) sum_i [y_i * log(p_i) + (1-y_i) * log(1-p_i)] BS = (1/n) sum_i (p_i - y_i)^2 (Brier Score) Both uniquely minimised by p_i = P(y_i=1|x_i). ECE is NOT a proper scoring rule. Post-hoc calibration — temperature scaling: T* = arg min_{T>0} NLL( sigma(logits / T), y ) Single parameter T fitted on held-out calibration set. Does not change accuracy.

Exhibit 4 — Reliability Diagrams: Three Calibration Pathologies confidence vs empirical accuracy

RLHF training systematically degrades calibration toward underconfidence. Temperature scaling corrects this without changing accuracy: find T* = arg min NLL(sigma(logits/T), y) on a held-out calibration set.

Goodhart's Law (1975), formalised for ML by Krakovna et al. (2020) and Gao et al. (2022), describes the failure mode of optimising for a proxy metric: as optimisation pressure increases, the proxy decouples from the underlying construct it was designed to measure.

Goodhart's Law — formal statement and overoptimisation model Let M : Theta → R be a metric (e.g., MMLU accuracy, reward model score) Let U : Theta → R be true utility (actual capability, alignment, safety) Goodhart's Law: lim_{M(theta) → M_max} Corr(M(theta), U(theta)) → 0 Formalised as overoptimisation (Gao et al. 2022): U(theta) ~= alpha * sqrt( KL(theta || theta_0) ) - beta * KL(theta || theta_0) True utility increases as sqrt(KL) (sublinear), then decreases as -beta*KL. This is the theoretical basis for KL penalties in RLHF.

Exhibit 5 — The Overoptimisation Curve: Metric vs. Utility Under Increasing KL Divergence Gao et al. 2022

Four empirical LLM instances: (1) MMLU saturation; (2) sycophancy from RLHF; (3) HumanEval gaming; (4) TruthfulQA/RLHF confident falsehoods. The operational fix is regular benchmark rotation.

Construct validity (Cronbach and Meehl, 1955) is the degree to which an instrument measures the theoretical construct it purports to measure. In LLM evaluation, this is the core problem: does a score on benchmark B actually measure capability or alignment property C?

Exhibit 6 — Construct Validity Chain: Four Inference Levels from string output to claimed value

Observable behavior — the literal token sequence output

The string generated by the model for input x. Not the model's "reasoning," not its "beliefs" — only the token sequence.

MEASURED DIRECTLY

Surface task performance — accuracy on the specific benchmark format

Does the output match the expected answer format? Failure modes: MCQ vs. free-form format sensitivity. Evidence required: cross-format replication.

INFERRED — STEP 1

Capability — the underlying cognitive skill the benchmark probes

Threats: contamination, format gaming, shortcut learning. Evidence required: cross-benchmark correlation, transfer to novel instances.

INFERRED — STEP 2

Alignment — the normative property (safety, helpfulness, honesty)

MMLU measures knowledge breadth, not whether the model uses that knowledge helpfully. IRR is lowest at this level because it requires normative agreement.

INFERRED — STEP 3

Values / character — what the model does under adversarial pressure

No static benchmark can fully evaluate this level — it requires ongoing adversarial red-teaming, behavioral monitoring, and mechanistic interpretability.

INFERRED — STEP 4

Each inference step multiplies uncertainty. A high MMLU score (steps 1-2) says little about deployment alignment (step 4). Never report a capability score as evidence for an alignment claim without explicit justification of the inference steps.

Convergent and discriminant validity. A valid measure of construct C should: (a) correlate with other measures of the same construct (convergent validity) and (b) not correlate with measures of different constructs (discriminant validity). MMLU, ARC, and HellaSwag correlate at r ≈ 0.85–0.95 across models, supporting convergent validity for "language understanding." But helpfulness and harmlessness should be approximately independent — yet RLHF-trained models show a confounded tradeoff driven by sycophancy.

A benchmark score is trustworthy only when all five validity axes are controlled. A score failing three or more should not be used to make comparative capability claims.

Exhibit 7 — Five-Dimension Trust Matrix for Evaluation Scores decision framework

Dimension	High Trust	Medium Trust	Low Trust	Diagnostic
1. Sampling variance	n > 1000, SE < 1.5pp	n = 200–1000, SE 2–4pp	n < 200, SE > 4pp	SE = sqrt(S(1-S)/n). Improvement must exceed 2×SE.
2. Prompt sensitivity	sigma_prompt < 1pp across 5+ templates	sigma_prompt 1–5pp, reported	sigma_prompt unreported or > 5pp	Run 5+ prompt variations; report mean and std.
3. Contamination	Benchmark post-dates training cutoff	N-gram overlap tested; c < 5%	No contamination analysis performed	Min-K% Prob or n-gram overlap. Gap >5pp indicates inflation.
4. Annotation reliability	Deterministic oracle or kappa ≥ 0.80	Human annotation kappa [0.60, 0.80)	Single annotator. kappa < 0.40.	For LLM-as-judge: report rho(judge, human) on ≥100 items.
5. Construct validity	3+ benchmarks of same construct agree	Single benchmark; limitations documented	Capability score cited as alignment evidence	Cross-benchmark correlation for convergent validity.

Reporting minimum for a trustworthy comparative claim: score ± SE(n), sigma_prompt over k≥5 templates, contamination rate c, IRR measure, and construct validity inference level.

Exhibit 8 — Major Benchmarks Audited on All Five Dimensions current state of the field

Benchmark	n (test)	Sampling SE	Contamination	Ground truth / IRR	Construct claim	Primary failure mode
MMLU	14,042 total ~150/subcat	±2–4pp per subcategory	High. 2021 data, extensively crawled.	MCQ automated, deterministic	Knowledge breadth → claimed "general intelligence"	Contamination + construct leap from MCQ to reasoning.
HumanEval	164	±4–6pp on pass@1	Moderate. GitHub in training.	Unit test execution. Near-perfect.	Code generation (narrow scope)	n=164 is very small. SWE-bench harder and more realistic.
MT-Bench	80 questions	SE large on 80 items	Low. Multi-turn, harder to contaminate.	GPT-4 judge. rho ~= 0.77 to human.	Conversational quality → alignment	80 questions yields large CIs. Judge biases.
Chatbot Arena	100K+ battles	SE tiny on Elo ratings	Live data post-training	Human pref. Selection bias in user population.	Human preference on diverse natural prompts	User population selection bias. Verbosity inflates win rates.
TruthfulQA	817	±2–3pp on 817	Adversarially designed to resist memorisation.	GPT-4 judge for free-form answers.	Honesty on adversarial Qs → general honesty	Models learn TruthfulQA distribution. Generalisation untested.
MATH / GSM8K	5000 / 1319	SE small at 5000	Moderate. Math in web text.	Deterministic answer matching	Mathematical reasoning (specified domain)	Final-answer matching misses incorrect reasoning chains.
SWE-bench	2294 (verified: 500)	SE < 2pp on verified set	GitHub history partial overlap.	Test suite pass rate. Deterministic.	Real-world software engineering	Harness complexity — test environment errors can mask failures.

No widely-used benchmark achieves high trust on all five dimensions simultaneously. The best-performing are Chatbot Arena (live data, large n, diverse prompts) and SWE-bench (deterministic oracle, real engineering tasks). MMLU — the most commonly cited — has high contamination risk and a large construct validity gap.

LLM Evaluation A mathematical treatment of what benchmark scores measure, how they break, and when to trust them

Position Bias Correction

Self-Preference Validity Threat

LLM Evaluation A mathematical treatment of what benchmark scores measure,
how they break, and when to trust them