Technical Memorandum
Optimisation Theory · Probabilistic Methods
February 2026
Mathematical Deep Dive
No PM framing
Bayesian Optimisation
A complete mathematical treatment — surrogate models, acquisition functions, Gaussian processes, convergence, and the geometry of sequential decisions
Abstract. Bayesian optimisation is a sequential, model-based strategy for the global optimisation of an expensive black-box function f : 𝒳 → ℝ. It maintains a probabilistic surrogate model — typically a Gaussian process — over f, and uses this model to define an acquisition function that directs sampling toward regions of high expected improvement. We derive the core mathematics: the Gaussian process posterior, the three canonical acquisition functions (EI, PI, UCB), kernel design and its relationship to function priors, marginalisation of hyperparameters, convergence in terms of cumulative regret, and the precise conditions under which BO outperforms gradient-based methods.
§ 1
The Problem Formulation
What BO solves and why derivatives are unavailable
We seek the global maximiser of a function f over a compact domain 𝒳 ⊆ ℝᵈ:
Problem statementx* = arg max_{x ∈ 𝒳} f(x)
The function f has three properties that make gradient descent inapplicable:
Property 1
Expensive to evaluate. Each query f(x) costs significantly — training a neural network, running a physics simulation, a laboratory experiment. We have a budget of T evaluations, typically T ∈ [10, 500]. This is many orders of magnitude fewer than the iterations available to gradient descent.
Property 2
Black-box and non-differentiable. We cannot compute ∂f/∂x. The function may be computed by an external process, involve discrete operations, or simply not be expressed in closed form. Gradient information is unavailable by definition.
Property 3
Possibly multi-modal.f may have multiple local maxima. Methods that follow local curvature will get trapped. We require a strategy with global coverage guarantees.
Bayesian optimisation addresses all three by building a probabilistic model of f from past evaluations, then using this model to choose the next query point — the one that best balances exploring unknown regions against exploiting regions known to be promising.
Exhibit 1 — The Bayesian Optimisation Loop 4-component sequential cycle
The loop has two optimisations: the outer loop (expensive, querying f) and an inner loop (cheap, maximising α(x) over the surrogate). The inner optimisation uses gradient-based or evolutionary methods — it is cheap because α is analytic and differentiable even when f is not.
§ 2
The Gaussian Process Surrogate
Prior specification, posterior update, and the predictive distribution
A Gaussian process (GP) is a distribution over functions. It is the canonical surrogate for Bayesian optimisation because it provides a principled, closed-form posterior that quantifies both the predicted value and uncertainty at every unobserved point — precisely what the acquisition function requires.
Definition
Gaussian Process. We say f ~ GP(m, k) if any finite collection of function values [f(x₁), …, f(xₙ)] is jointly Gaussian distributed with mean vector [m(x₁), …, m(xₙ)] and covariance matrix [k(xᵢ, xⱼ)]. The mean function m(x) = 𝔼[f(x)] and kernel k(x, x') = Cov(f(x), f(x')) completely specify the distribution.
In practice we set m(x) = 0 (absorbed into the kernel's variance parameter) and focus on kernel design. Given n observations Dₙ = {(xᵢ, yᵢ)}ᵢ₌₁ⁿ with yᵢ = f(xᵢ) + εᵢ, εᵢ ~ 𝒩(0, σ²ₙ), the posterior over f at any test point x* is Gaussian with closed-form parameters:
GP Posterior — Predictive Distribution at x*f(x*) | Dₙ ~ 𝒩( μₙ(x*), σ²ₙ(x*) )μₙ(x*) = k(x*, X) [K(X,X) + σ²ₙ I]⁻¹ y ← posterior mean (prediction)σ²ₙ(x*) = k(x*,x*) − k(x*,X) [K(X,X) + σ²ₙ I]⁻¹ k(X,x*) ← posterior variance (uncertainty)where: K(X,X)ᵢⱼ = k(xᵢ,xⱼ) is the n×n kernel matrix (Gram matrix) k(x*,X) is the 1×n cross-covariance vector y = [y₁,…,yₙ]ᵀ is the vector of observations
The posterior meanμₙ(x*) is the GP's best prediction of f(x*) given all data seen so far. The posterior varianceσ²ₙ(x*) quantifies uncertainty: it equals the prior variance at x* minus the information gained from observations. Points near observed data have low variance; unexplored regions retain high variance.
Notebook 1 — GP Surrogate: From Prior to PosteriorRunnable Python
Implements a Gaussian Process from scratch (NumPy only). Visualises the prior → posterior progression as observations collapse uncertainty, and demonstrates the effect of noise on the posterior.
RBF kernel implementation and Cholesky sampling from the GP prior
Closed-form posterior: μₙ(x*) and σ²ₙ(x*) with 0, 3, and 7 observations
Effect of noise variance σ²ₙ on posterior smoothness
Computational cost. The critical bottleneck is matrix inversion: computing [K(X,X) + σ²ₙ I]⁻¹ costs O(n³) time and O(n²) space. This is acceptable for small n (say, n ≤ 10⁴), but becomes prohibitive as the evaluation budget grows. Sparse GPs, inducing point methods (SGPR), and Cholesky decompositions with rank-1 updates mitigate this in practice.
§ 3
Kernel Design and Function Priors
The kernel encodes all prior beliefs about f — smoothness, periodicity, scale
The kernel function k(x, x') is the only place where prior knowledge about f enters the model. It must be positive semi-definite (so that K(X,X) is always a valid covariance matrix). Every valid kernel corresponds to a specific class of functions that the GP prior places mass on.
Notebook 2 — Kernel Design & Function PriorsRunnable Python
Implements RBF, Matérn 5/2, Matérn 3/2, and Periodic kernels from scratch. Visualises how each kernel shapes the GP prior, how length-scale controls smoothness, and how kernels compose.
Kernel correlation profiles: k(r) vs distance for all four kernels
GP prior samples under each kernel — same randomness, different function classes
The mathematical encoding of exploration vs. exploitation
The acquisition function α : 𝒳 → ℝ is a function of the GP posterior — cheap to evaluate and to optimise — that quantifies how useful querying f at a candidate point x would be. All canonical acquisition functions are derived from the posterior predictive distribution 𝒩(μₙ(x), σ²ₙ(x)).
Define f* = max{y₁,…,yₙ} as the current best observed value. Then:
The three canonical acquisition functions — full derivations1. Probability of Improvement (PI) — Kushner, 1964 α_PI(x) = P( f(x) > f* + ξ ) = Φ( (μₙ(x) − f* − ξ) / σₙ(x) )Φ = CDF of standard normal. ξ ≥ 0 is a trade-off parameter.PI only measures probability of improvement, ignoring improvement magnitude.Greedy (ξ=0) → exploits; large ξ → forces exploration.2. Expected Improvement (EI) — Mockus, 1978 α_EI(x) = 𝔼[ max(f(x) − f*, 0) ] = (μₙ(x) − f*) · Φ(Z) + σₙ(x) · φ(Z) Z = (μₙ(x)−f*)/σₙ(x)φ = PDF of standard normal. This closed form arises from integratingmax(f−f*,0) over the Gaussian predictive distribution 𝒩(μₙ,σ²ₙ).First term: exploitation (high mean → high reward). Second term: exploration (high σ → high reward).EI = 0 iff σₙ(x) = 0 (already observed exactly) or μₙ(x) << f*.3. Upper Confidence Bound (UCB) — Srinivas et al., 2010 α_UCB(x) = μₙ(x) + β½ · σₙ(x)β > 0 is the exploration parameter (controls confidence width).β→0: pure exploitation (follow mean). β→∞: pure exploration (follow σ).Optimal β schedule (Srinivas 2010): β_t = 2log(|𝒳|t²π²/6δ) for finite 𝒳,or β_t = 2log(t^(d/2+2) π²/3δ) for continuous 𝒳 ⊆ ℝᵈ.UCB has the strongest theoretical convergence guarantees of the three.
Notebook 3 — Acquisition Functions & Full BO LoopRunnable Python
Implements EI, PI, and UCB from scratch. Visualises all three on the same GP posterior, runs a complete BO loop on a multi-modal function, compares BO vs random search, and demonstrates the exploration–exploitation spectrum.
PI, EI, UCB: same GP posterior → three different next points
Full Bayesian Optimisation loop with EI (step-by-step visualization)
BO vs Random Search convergence (20 runs, shaded IQR)
UCB β spectrum: β=0.01 (exploit) to β=20 (explore)
How β in UCB and ξ in EI control the trade-off — and the optimal schedule
Every acquisition function embeds a trade-off between querying where the mean is high (exploitation) and where uncertainty is high (exploration). The key insight is that both are necessary: a purely exploitative strategy converges to a local optimum near the initial observations; a purely exploratory strategy wastes evaluations uniformly.
Exhibit 5 — Exploration–Exploitation Spectrum via UCB β Parameter β controls the confidence width
β → 0pure exploitation
exploit
α_UCB ≈ μₙ(x) — always query where the posterior mean is highest. Greedy. Converges quickly to a local optimum near initial observations. Will miss global maximum if initialisation is poor.
β = 0.1–1.0mild exploration
exploit / explore
Slight preference for unexplored regions when mean values are comparable. Good for smooth, well-behaved functions where exploration cost is low relative to improvement gained.
β ≈ 2 log(t)★ optimal schedule
balanced — increases with t
Srinivas et al. (2010): setting β_t = 2 log(t^(d/2+2)π²/3δ) yields sublinear cumulative regret R_T = O(√T · γ_T log T) with probability 1−δ. β grows with time to maintain coverage as 𝒳 is increasingly explored.
β = 5–10heavy exploration
explore
Strongly prefers uncertain regions. Useful in early iterations (small n, poor coverage) or when f is highly multi-modal and missing the global optimum is catastrophic.
β → ∞pure exploration
explore only
α_UCB ≈ σₙ(x) — always query the most uncertain point. Equivalent to maximum variance sampling. Wastes evaluations in regions confirmed to have low f. Exploration without exploitation is random search.
The EI analogue: ξ in α_EI(x; ξ) = (μₙ(x) − f* − ξ)Φ(Z) + σₙ(x)φ(Z). Large ξ requires f(x) to exceed f* by more than ξ to be worth exploring — forces exploration of uncertain regions. In practice, ξ = 0.01 is the most common default for EI (Lizotte 2008).
§ 6
Hyperparameter Marginalisation
Why integrating out θ is correct — and how it's approximated in practice
The GP has hyperparameters θ = {ℓ, σ_f, σ_n, …} (length-scale, signal variance, noise variance, etc.). Most implementations fit θ by maximum marginal likelihood (type-II MLE) and treat it as fixed. This is a point estimate, which can be overconfident when data is sparse.
The fully Bayesian approach places a prior p(θ) over hyperparameters and integrates them out:
Fully Marginalised Acquisition — integrating over GP hyperparametersα(x | Dₙ) = ∫ α(x | Dₙ, θ) · p(θ | Dₙ) dθwhere: p(θ | Dₙ) ∝ p(y | X, θ) · p(θ) ← posterior over hyperparametersThis integral has no closed form. Approximated by: (a) MCMC sampling: α(x) ≈ (1/S) Σₛ α(x | Dₙ, θˢ), θˢ ~ p(θ|Dₙ) via HMC (b) Slice sampling (SPEARMINT, Snoek et al. 2012) (c) Laplace approximation: p(θ|Dₙ) ≈ 𝒩(θ_MAP, [−∇² log p]⁻¹)
Point Estimate (Type-II MLE)
θ* = arg max_θ log p(y | X, θ). Fast: single L-BFGS run. Used in BoTorch, GPyOpt defaults. Overconfident: the acquisition function behaves as if θ* is known exactly, which underestimates uncertainty in θ — especially dangerous with few observations (n < 20). Can lead to convergence to a suboptimal local maximum.
Full Marginalisation (MCMC)
Treats GP hyperparameters as random variables with a prior. Averages the acquisition function over the posterior p(θ | Dₙ). More expensive but provides correct uncertainty quantification. Particularly important when the length-scale is poorly identified from few points. SPEARMINT used this; GPflowOpt and BoTorch support it via NUTS/HMC.
When marginalisation matters most: in very low-data regimes (n ≤ 10), when the kernel structure is uncertain, or when the acquisition function is used to make high-stakes decisions. In typical hyperparameter tuning with budgets of T = 50–200, Type-II MLE performs comparably to full marginalisation and is far cheaper.
§ 7
Convergence Theory — Regret Analysis
What theoretical guarantees actually say — and what they do not
Convergence of BO is analysed through the lens of regret — the gap between the values queried and the true optimum. There are two distinct notions:
Simple Regret r_T
r_T = f(x*) − max_{t≤T} f(x_t)
The gap between the true maximum and the best point found after T evaluations. Measures final quality. A Bayesian optimiser has converged when r_T → 0. This is the quantity optimised in the recommendation step: return x̂* = arg max_{t≤T} f(x_t).
Cumulative Regret R_T
R_T = Σ_{t=1}^T [ f(x*) − f(x_t) ]
The total suboptimality accumulated over all T queries. Used to bound simple regret: r_T ≤ R_T / T. An algorithm is no-regret if R_T / T → 0. This requires that the algorithm continues to improve — not just one lucky query.
UCB Regret Bound — Srinivas, Krause, Kakade, Seeger (2010) — ICML Best PaperWith probability ≥ 1 − δ, UCB-BO with β_t = 2 log(|𝒳|t²π²/6δ) satisfies: R_T ≤ √( C₁ · T · β_T · γ_T )where: γ_T = max information gain achievable by T points about f C₁ = 8/log(1+σ⁻²_n) (constant depending on noise level)γ_T depends on the kernel: RBF kernel: γ_T = O( (log T)^(d+1) ) ← sublinear, fast convergence Matérn ν kernel: γ_T = O( T^(d/2ν+d) · (log T)^s ) ← depends on smoothness ν Linear kernel: γ_T = O( d log T ) ← d-dimensional linear functionFor RBF: R_T = O( √T · (log T)^(d+1) ), so R_T/T → 0 (no-regret)Simple regret: r_T ≤ R_T/T = O( (log T)^(d+1) / √T ) → 0 as T → ∞
What the theorem does not say: it does not provide useful bounds for the finite budgets typical in practice (T = 50). The constants hidden in the O(·) notation are large and problem-dependent. The bound holds for the exact GP posterior with known hyperparameters — in practice, hyperparameters are estimated, which violates the theoretical assumptions. Convergence guarantees are asymptotic results; practical performance is empirical.
Maximum Information Gain γ_T — the kernel-specific bottleneckγ_T = max_{A⊂𝒳, |A|=T} I(y_A ; f) = ½ log |I + σ⁻²_n K(A,A)|I(y_A ; f) is the mutual information between T observations y_A and the function f.γ_T measures how quickly observations "fill in" the function. Smoother kernels(RBF) have low γ_T — a few points reveal a lot. Rough kernels have high γ_T.The d-dimensional scaling γ_T ∝ d is why BO degrades in high dimensions.
§ 8
When BO Dominates Gradient Descent — and When It Doesn't
The decision boundary is a function of dimensionality, budget, and differentiability
This is a mathematical decision, not a heuristic one. The crossover point between BO and gradient-based methods is determined by three quantities: the evaluation budgetT, the input dimensiond, and gradient availability.
Exhibit 6 — The BO vs. GD Decision Boundary as a function of evaluation budget and dimensionality
The curse of dimensionality strikes BO through γ_T: in high dimensions, the number of observations needed to adequately cover 𝒳 grows exponentially. The practical threshold is approximately d ≤ 20 for standard GP-BO; for d > 20, random embeddings (REMBO), additive decompositions, or gradient-enhanced GPs become necessary.
Regime
Use BO?
Mathematical Reason
d ≤ 20, T ≤ 500, ∇f unavailable
✓ Yes, GP-BO
γ_T sublinear in T; GP posterior informative; acquisition function cheaper than f per query
d ≤ 5, T ≤ 30
✓ Yes, strong
Low d → GP posterior tight with few points; EI efficiently concentrates evaluations near x*
20 < d ≤ 100, T ≤ 1000
⚠ BO with tricks
REMBO (random embedding), HESBO, or additive GP decompositions extend BO to moderate d. Vanilla GP-BO fails.
d > 100, ∇f available
✗ Use GD
GD amortises per-iteration cost over the full gradient; BO GP cost O(n³) cannot compete at scale
f differentiable, T > 10⁴ feasible
✗ Use GD
Each GD step is O(d) vs. BO's O(n³) + inner optimisation. The gradient is an exponentially more informative signal.
f noisy (σ_n large), d ≤ 10
✓ BO handles noise
GP naturally models observation noise σ²_n. The posterior mean is an automatic smoothed estimate. GD on noisy f is ill-posed without careful variance reduction (SVRG, SAG).
f is a neural network (d = 10⁶+)
✗ Definitively GD
Training a GP with 10⁶-dimensional inputs is computationally impossible. Adam + backprop is the only viable strategy. BO is restricted to the hyperparameter space of the network, not its weights.
§ 9
Extensions and Active Research Directions
Beyond the standard formulation — the mathematical frontier
Exhibit 7 — Insights Pyramid: Standard Formulation to Open Research layered from known to frontier
Standard GP-BO: GP surrogate + EI/UCB acquisition + MLE hyperparameters
The formulation in §2–5. Works for d≤10, T≤200, noise-free or mildly noisy f.
q-EI: 𝔼[max(f(x₁)…f(xq)) − f*] — analytically intractable for q>1; approximated by Monte Carlo or fantasisation. Used when evaluations can run in parallel (e.g., k8s job farm).
KNOWN EXTENSION
Constrained BO: optimise f(x) subject to cᵢ(x) ≥ 0
Model each constraint with a separate GP: p(cᵢ(x) ≥ 0 | Dₙ). Acquisition becomes α(x) · P(feasible | x). SCBO (scalable constrained BO) extends to trust regions for stability.
REMBO: embed 𝒳 ⊆ ℝᵈ into a low-dimensional subspace ℝᵏ (k<<d) via random matrix A ∈ ℝ^{d×k}. Effective if f has low intrinsic dimensionality. TuRBO: restricts BO to a trust region that shrinks/grows based on success — avoids the exploration-exploitation pathology in high d.
ACTIVE AREA
Non-GP surrogates: Random Forests (SMAC), TPE, Deep Kernel Learning
SMAC (Hutter 2011) uses random forests as surrogate — handles categorical inputs and conditional hyperparameters natively. TPE (Bergstra 2011) models p(x|y) rather than p(y|x). DKL (Wilson 2016): deep neural network feature extractor + GP, combining scalability with posterior quality. These abandon closed-form posteriors for scalability.
FRONTIER
BO for LLMs: gradient-free fine-tuning, prompt optimisation, RLHF loop design
As LLM evaluation becomes the expensive oracle — A/B test, human evaluation, downstream task performance — BO over the prompt/adapter/configuration space is an active area. Key challenge: the response variable is itself a distribution, not a scalar. Multi-objective BO (Pareto front) handles simultaneous optimisation of capability, safety, and efficiency metrics without reduction to a single scalar.
OPEN FRONTIER
The deepest open question in BO theory: for what function classes does the cumulative regret of EI satisfy the same sublinear bound as UCB? UCB has O(√T·γ_T log T) guarantees under mild conditions. EI's convergence is harder to prove because it does not directly control the exploration-exploitation trade-off through an explicit parameter — yet empirically EI often outperforms UCB. The gap between theory and practice here is substantial.