Technical Memorandum
Optimisation Theory · Probabilistic Methods
February 2026
Mathematical Deep Dive
No PM framing

Bayesian Optimisation A complete mathematical treatment — surrogate models, acquisition functions,
Gaussian processes, convergence, and the geometry of sequential decisions

Abstract. Bayesian optimisation is a sequential, model-based strategy for the global optimisation of an expensive black-box function f : 𝒳 → ℝ. It maintains a probabilistic surrogate model — typically a Gaussian process — over f, and uses this model to define an acquisition function that directs sampling toward regions of high expected improvement. We derive the core mathematics: the Gaussian process posterior, the three canonical acquisition functions (EI, PI, UCB), kernel design and its relationship to function priors, marginalisation of hyperparameters, convergence in terms of cumulative regret, and the precise conditions under which BO outperforms gradient-based methods.
§ 1
The Problem Formulation
What BO solves and why derivatives are unavailable

We seek the global maximiser of a function f over a compact domain 𝒳 ⊆ ℝᵈ:

Problem statement x* = arg max_{x ∈ 𝒳} f(x)

The function f has three properties that make gradient descent inapplicable:

Property 1
Expensive to evaluate. Each query f(x) costs significantly — training a neural network, running a physics simulation, a laboratory experiment. We have a budget of T evaluations, typically T ∈ [10, 500]. This is many orders of magnitude fewer than the iterations available to gradient descent.
Property 2
Black-box and non-differentiable. We cannot compute ∂f/∂x. The function may be computed by an external process, involve discrete operations, or simply not be expressed in closed form. Gradient information is unavailable by definition.
Property 3
Possibly multi-modal. f may have multiple local maxima. Methods that follow local curvature will get trapped. We require a strategy with global coverage guarantees.

Bayesian optimisation addresses all three by building a probabilistic model of f from past evaluations, then using this model to choose the next query point — the one that best balances exploring unknown regions against exploiting regions known to be promising.

Exhibit 1 — The Bayesian Optimisation Loop 4-component sequential cycle
SURROGATE MODEL Gaussian Process p(f | x₁..xₙ, y₁..yₙ) posterior mean μₙ(x) posterior var σ²ₙ(x) ACQUISITION FN α(x) — cheap to optimise EI, UCB, or PI x_{n+1} = arg max α(x) gradient-free inner opt TRUE EVALUATION y_{n+1} = f(x_{n+1}) + ε ε ~ 𝒩(0, σ²_noise) expensive oracle call (T budget total) POSTERIOR UPDATE Bayes' rule update Dₙ ← Dₙ ∪ {xₙ₊₁, yₙ₊₁} O(n³) GP refit n ← n + 1 REPEAT UNTIL BUDGET EXHAUSTED INITIALISE D₀ = Latin hypercube OUTPUT: x̂* = arg max f(xᵢ) over all observed points
The loop has two optimisations: the outer loop (expensive, querying f) and an inner loop (cheap, maximising α(x) over the surrogate). The inner optimisation uses gradient-based or evolutionary methods — it is cheap because α is analytic and differentiable even when f is not.
§ 2
The Gaussian Process Surrogate
Prior specification, posterior update, and the predictive distribution

A Gaussian process (GP) is a distribution over functions. It is the canonical surrogate for Bayesian optimisation because it provides a principled, closed-form posterior that quantifies both the predicted value and uncertainty at every unobserved point — precisely what the acquisition function requires.

Definition
Gaussian Process. We say f ~ GP(m, k) if any finite collection of function values [f(x₁), …, f(xₙ)] is jointly Gaussian distributed with mean vector [m(x₁), …, m(xₙ)] and covariance matrix [k(xᵢ, xⱼ)]. The mean function m(x) = 𝔼[f(x)] and kernel k(x, x') = Cov(f(x), f(x')) completely specify the distribution.

In practice we set m(x) = 0 (absorbed into the kernel's variance parameter) and focus on kernel design. Given n observations Dₙ = {(xᵢ, yᵢ)}ᵢ₌₁ⁿ with yᵢ = f(xᵢ) + εᵢ, εᵢ ~ 𝒩(0, σ²ₙ), the posterior over f at any test point x* is Gaussian with closed-form parameters:

GP Posterior — Predictive Distribution at x* f(x*) | Dₙ ~ 𝒩( μₙ(x*), σ²ₙ(x*) ) μₙ(x*) = k(x*, X) [K(X,X) + σ²ₙ I]⁻¹ y ← posterior mean (prediction) σ²ₙ(x*) = k(x*,x*) − k(x*,X) [K(X,X) + σ²ₙ I]⁻¹ k(X,x*) ← posterior variance (uncertainty) where: K(X,X)ᵢⱼ = k(xᵢ,xⱼ) is the n×n kernel matrix (Gram matrix) k(x*,X) is the 1×n cross-covariance vector y = [y₁,…,yₙ]ᵀ is the vector of observations

The posterior mean μₙ(x*) is the GP's best prediction of f(x*) given all data seen so far. The posterior variance σ²ₙ(x*) quantifies uncertainty: it equals the prior variance at x* minus the information gained from observations. Points near observed data have low variance; unexplored regions retain high variance.

Notebook 1 — GP Surrogate: From Prior to Posterior Runnable Python

Implements a Gaussian Process from scratch (NumPy only). Visualises the prior → posterior progression as observations collapse uncertainty, and demonstrates the effect of noise on the posterior.

Computational cost. The critical bottleneck is matrix inversion: computing [K(X,X) + σ²ₙ I]⁻¹ costs O(n³) time and O(n²) space. This is acceptable for small n (say, n ≤ 10⁴), but becomes prohibitive as the evaluation budget grows. Sparse GPs, inducing point methods (SGPR), and Cholesky decompositions with rank-1 updates mitigate this in practice.

§ 3
Kernel Design and Function Priors
The kernel encodes all prior beliefs about f — smoothness, periodicity, scale

The kernel function k(x, x') is the only place where prior knowledge about f enters the model. It must be positive semi-definite (so that K(X,X) is always a valid covariance matrix). Every valid kernel corresponds to a specific class of functions that the GP prior places mass on.

Notebook 2 — Kernel Design & Function Priors Runnable Python

Implements RBF, Matérn 5/2, Matérn 3/2, and Periodic kernels from scratch. Visualises how each kernel shapes the GP prior, how length-scale controls smoothness, and how kernels compose.

§ 4
Acquisition Functions
The mathematical encoding of exploration vs. exploitation

The acquisition function α : 𝒳 → ℝ is a function of the GP posterior — cheap to evaluate and to optimise — that quantifies how useful querying f at a candidate point x would be. All canonical acquisition functions are derived from the posterior predictive distribution 𝒩(μₙ(x), σ²ₙ(x)).

Define f* = max{y₁,…,yₙ} as the current best observed value. Then:

The three canonical acquisition functions — full derivations 1. Probability of Improvement (PI) — Kushner, 1964 α_PI(x) = P( f(x) > f* + ξ ) = Φ( (μₙ(x) − f* − ξ) / σₙ(x) ) Φ = CDF of standard normal. ξ ≥ 0 is a trade-off parameter. PI only measures probability of improvement, ignoring improvement magnitude. Greedy (ξ=0) → exploits; large ξ → forces exploration. 2. Expected Improvement (EI) — Mockus, 1978 α_EI(x) = 𝔼[ max(f(x) − f*, 0) ] = (μₙ(x) − f*) · Φ(Z) + σₙ(x) · φ(Z) Z = (μₙ(x)−f*)/σₙ(x) φ = PDF of standard normal. This closed form arises from integrating max(f−f*,0) over the Gaussian predictive distribution 𝒩(μₙ,σ²ₙ). First term: exploitation (high mean → high reward). Second term: exploration (high σ → high reward). EI = 0 iff σₙ(x) = 0 (already observed exactly) or μₙ(x) << f*. 3. Upper Confidence Bound (UCB) — Srinivas et al., 2010 α_UCB(x) = μₙ(x) + β½ · σₙ(x) β > 0 is the exploration parameter (controls confidence width). β→0: pure exploitation (follow mean). β→∞: pure exploration (follow σ). Optimal β schedule (Srinivas 2010): β_t = 2log(|𝒳|t²π²/6δ) for finite 𝒳, or β_t = 2log(t^(d/2+2) π²/3δ) for continuous 𝒳 ⊆ ℝᵈ. UCB has the strongest theoretical convergence guarantees of the three.
Notebook 3 — Acquisition Functions & Full BO Loop Runnable Python

Implements EI, PI, and UCB from scratch. Visualises all three on the same GP posterior, runs a complete BO loop on a multi-modal function, compares BO vs random search, and demonstrates the exploration–exploitation spectrum.

§ 5
The Exploration–Exploitation Spectrum
How β in UCB and ξ in EI control the trade-off — and the optimal schedule

Every acquisition function embeds a trade-off between querying where the mean is high (exploitation) and where uncertainty is high (exploration). The key insight is that both are necessary: a purely exploitative strategy converges to a local optimum near the initial observations; a purely exploratory strategy wastes evaluations uniformly.

Exhibit 5 — Exploration–Exploitation Spectrum via UCB β Parameter β controls the confidence width
β → 0pure exploitation
exploit
α_UCB ≈ μₙ(x) — always query where the posterior mean is highest. Greedy. Converges quickly to a local optimum near initial observations. Will miss global maximum if initialisation is poor.
β = 0.1–1.0mild exploration
exploit / explore
Slight preference for unexplored regions when mean values are comparable. Good for smooth, well-behaved functions where exploration cost is low relative to improvement gained.
β ≈ 2 log(t)★ optimal schedule
balanced — increases with t
Srinivas et al. (2010): setting β_t = 2 log(t^(d/2+2)π²/3δ) yields sublinear cumulative regret R_T = O(√T · γ_T log T) with probability 1−δ. β grows with time to maintain coverage as 𝒳 is increasingly explored.
β = 5–10heavy exploration
explore
Strongly prefers uncertain regions. Useful in early iterations (small n, poor coverage) or when f is highly multi-modal and missing the global optimum is catastrophic.
β → ∞pure exploration
explore only
α_UCB ≈ σₙ(x) — always query the most uncertain point. Equivalent to maximum variance sampling. Wastes evaluations in regions confirmed to have low f. Exploration without exploitation is random search.
The EI analogue: ξ in α_EI(x; ξ) = (μₙ(x) − f* − ξ)Φ(Z) + σₙ(x)φ(Z). Large ξ requires f(x) to exceed f* by more than ξ to be worth exploring — forces exploration of uncertain regions. In practice, ξ = 0.01 is the most common default for EI (Lizotte 2008).
§ 6
Hyperparameter Marginalisation
Why integrating out θ is correct — and how it's approximated in practice

The GP has hyperparameters θ = {ℓ, σ_f, σ_n, …} (length-scale, signal variance, noise variance, etc.). Most implementations fit θ by maximum marginal likelihood (type-II MLE) and treat it as fixed. This is a point estimate, which can be overconfident when data is sparse.

The fully Bayesian approach places a prior p(θ) over hyperparameters and integrates them out:

Fully Marginalised Acquisition — integrating over GP hyperparameters α(x | Dₙ) = ∫ α(x | Dₙ, θ) · p(θ | Dₙ) dθ where: p(θ | Dₙ) ∝ p(y | X, θ) · p(θ) ← posterior over hyperparameters This integral has no closed form. Approximated by: (a) MCMC sampling: α(x) ≈ (1/S) Σₛ α(x | Dₙ, θˢ), θˢ ~ p(θ|Dₙ) via HMC (b) Slice sampling (SPEARMINT, Snoek et al. 2012) (c) Laplace approximation: p(θ|Dₙ) ≈ 𝒩(θ_MAP, [−∇² log p]⁻¹)

Point Estimate (Type-II MLE)

θ* = arg max_θ log p(y | X, θ). Fast: single L-BFGS run. Used in BoTorch, GPyOpt defaults. Overconfident: the acquisition function behaves as if θ* is known exactly, which underestimates uncertainty in θ — especially dangerous with few observations (n < 20). Can lead to convergence to a suboptimal local maximum.

Full Marginalisation (MCMC)

Treats GP hyperparameters as random variables with a prior. Averages the acquisition function over the posterior p(θ | Dₙ). More expensive but provides correct uncertainty quantification. Particularly important when the length-scale is poorly identified from few points. SPEARMINT used this; GPflowOpt and BoTorch support it via NUTS/HMC.

When marginalisation matters most: in very low-data regimes (n ≤ 10), when the kernel structure is uncertain, or when the acquisition function is used to make high-stakes decisions. In typical hyperparameter tuning with budgets of T = 50–200, Type-II MLE performs comparably to full marginalisation and is far cheaper.

§ 7
Convergence Theory — Regret Analysis
What theoretical guarantees actually say — and what they do not

Convergence of BO is analysed through the lens of regret — the gap between the values queried and the true optimum. There are two distinct notions:

Simple Regret r_T
r_T = f(x*) − max_{t≤T} f(x_t)
The gap between the true maximum and the best point found after T evaluations. Measures final quality. A Bayesian optimiser has converged when r_T → 0. This is the quantity optimised in the recommendation step: return x̂* = arg max_{t≤T} f(x_t).
Cumulative Regret R_T
R_T = Σ_{t=1}^T [ f(x*) − f(x_t) ]
The total suboptimality accumulated over all T queries. Used to bound simple regret: r_T ≤ R_T / T. An algorithm is no-regret if R_T / T → 0. This requires that the algorithm continues to improve — not just one lucky query.
UCB Regret Bound — Srinivas, Krause, Kakade, Seeger (2010) — ICML Best Paper With probability ≥ 1 − δ, UCB-BO with β_t = 2 log(|𝒳|t²π²/6δ) satisfies: R_T ≤ √( C₁ · T · β_T · γ_T ) where: γ_T = max information gain achievable by T points about f C₁ = 8/log(1+σ⁻²_n) (constant depending on noise level) γ_T depends on the kernel: RBF kernel: γ_T = O( (log T)^(d+1) ) ← sublinear, fast convergence Matérn ν kernel: γ_T = O( T^(d/2ν+d) · (log T)^s ) ← depends on smoothness ν Linear kernel: γ_T = O( d log T ) ← d-dimensional linear function For RBF: R_T = O( √T · (log T)^(d+1) ), so R_T/T → 0 (no-regret) Simple regret: r_T ≤ R_T/T = O( (log T)^(d+1) / √T ) → 0 as T → ∞

What the theorem does not say: it does not provide useful bounds for the finite budgets typical in practice (T = 50). The constants hidden in the O(·) notation are large and problem-dependent. The bound holds for the exact GP posterior with known hyperparameters — in practice, hyperparameters are estimated, which violates the theoretical assumptions. Convergence guarantees are asymptotic results; practical performance is empirical.

Maximum Information Gain γ_T — the kernel-specific bottleneck γ_T = max_{A⊂𝒳, |A|=T} I(y_A ; f) = ½ log |I + σ⁻²_n K(A,A)| I(y_A ; f) is the mutual information between T observations y_A and the function f. γ_T measures how quickly observations "fill in" the function. Smoother kernels (RBF) have low γ_T — a few points reveal a lot. Rough kernels have high γ_T. The d-dimensional scaling γ_T ∝ d is why BO degrades in high dimensions.
§ 8
When BO Dominates Gradient Descent — and When It Doesn't
The decision boundary is a function of dimensionality, budget, and differentiability

This is a mathematical decision, not a heuristic one. The crossover point between BO and gradient-based methods is determined by three quantities: the evaluation budget T, the input dimension d, and gradient availability.

Exhibit 6 — The BO vs. GD Decision Boundary as a function of evaluation budget and dimensionality
INPUT DIMENSION d → BUDGET T → d=2 d=5 d=10 d=20 d=50+ T=10 T=50 T=200 T=1000+ BO dominates low d, expensive f T ≤ ~500 realistic no gradient available GD dominates high d, cheap f T >> 10⁴ feasible gradient available hyperparameter tuning (d≤10) NAS / drug discovery NN weight training crossover boundary approx: BO effective when d ≤ 20, T ≤ 500
The curse of dimensionality strikes BO through γ_T: in high dimensions, the number of observations needed to adequately cover 𝒳 grows exponentially. The practical threshold is approximately d ≤ 20 for standard GP-BO; for d > 20, random embeddings (REMBO), additive decompositions, or gradient-enhanced GPs become necessary.
Regime Use BO? Mathematical Reason
d ≤ 20, T ≤ 500, ∇f unavailable ✓ Yes, GP-BO γ_T sublinear in T; GP posterior informative; acquisition function cheaper than f per query
d ≤ 5, T ≤ 30 ✓ Yes, strong Low d → GP posterior tight with few points; EI efficiently concentrates evaluations near x*
20 < d ≤ 100, T ≤ 1000 ⚠ BO with tricks REMBO (random embedding), HESBO, or additive GP decompositions extend BO to moderate d. Vanilla GP-BO fails.
d > 100, ∇f available ✗ Use GD GD amortises per-iteration cost over the full gradient; BO GP cost O(n³) cannot compete at scale
f differentiable, T > 10⁴ feasible ✗ Use GD Each GD step is O(d) vs. BO's O(n³) + inner optimisation. The gradient is an exponentially more informative signal.
f noisy (σ_n large), d ≤ 10 ✓ BO handles noise GP naturally models observation noise σ²_n. The posterior mean is an automatic smoothed estimate. GD on noisy f is ill-posed without careful variance reduction (SVRG, SAG).
f is a neural network (d = 10⁶+) ✗ Definitively GD Training a GP with 10⁶-dimensional inputs is computationally impossible. Adam + backprop is the only viable strategy. BO is restricted to the hyperparameter space of the network, not its weights.
§ 9
Extensions and Active Research Directions
Beyond the standard formulation — the mathematical frontier
Exhibit 7 — Insights Pyramid: Standard Formulation to Open Research layered from known to frontier
Standard GP-BO: GP surrogate + EI/UCB acquisition + MLE hyperparameters
The formulation in §2–5. Works for d≤10, T≤200, noise-free or mildly noisy f.
STANDARD
Batch BO: selecting q > 1 points per iteration (parallel evaluations)
q-EI: 𝔼[max(f(x₁)…f(xq)) − f*] — analytically intractable for q>1; approximated by Monte Carlo or fantasisation. Used when evaluations can run in parallel (e.g., k8s job farm).
KNOWN EXTENSION
Constrained BO: optimise f(x) subject to cᵢ(x) ≥ 0
Model each constraint with a separate GP: p(cᵢ(x) ≥ 0 | Dₙ). Acquisition becomes α(x) · P(feasible | x). SCBO (scalable constrained BO) extends to trust regions for stability.
ACTIVE AREA
High-dimensional BO: REMBO, ALEBO, Additive BO, TuRBO (trust regions)
REMBO: embed 𝒳 ⊆ ℝᵈ into a low-dimensional subspace ℝᵏ (k<<d) via random matrix A ∈ ℝ^{d×k}. Effective if f has low intrinsic dimensionality. TuRBO: restricts BO to a trust region that shrinks/grows based on success — avoids the exploration-exploitation pathology in high d.
ACTIVE AREA
Non-GP surrogates: Random Forests (SMAC), TPE, Deep Kernel Learning
SMAC (Hutter 2011) uses random forests as surrogate — handles categorical inputs and conditional hyperparameters natively. TPE (Bergstra 2011) models p(x|y) rather than p(y|x). DKL (Wilson 2016): deep neural network feature extractor + GP, combining scalability with posterior quality. These abandon closed-form posteriors for scalability.
FRONTIER
BO for LLMs: gradient-free fine-tuning, prompt optimisation, RLHF loop design
As LLM evaluation becomes the expensive oracle — A/B test, human evaluation, downstream task performance — BO over the prompt/adapter/configuration space is an active area. Key challenge: the response variable is itself a distribution, not a scalar. Multi-objective BO (Pareto front) handles simultaneous optimisation of capability, safety, and efficiency metrics without reduction to a single scalar.
OPEN FRONTIER
The deepest open question in BO theory: for what function classes does the cumulative regret of EI satisfy the same sublinear bound as UCB? UCB has O(√T·γ_T log T) guarantees under mild conditions. EI's convergence is harder to prove because it does not directly control the exploration-exploitation trade-off through an explicit parameter — yet empirically EI often outperforms UCB. The gap between theory and practice here is substantial.