GP
Gaussian Process
Gaussian Process (GP) is a Bayesian optimization method, like TPE. As introduced in the TPE section, both use Bayesian optimization but differ in how they model the relationship between parameters and objective values.
- GP models
p(y|x)directly — the distribution of objective values given parameter values. - TPE models
p(x|y)andp(y)separately.
What is a Gaussian Process?
A Gaussian Process defines a probability distribution over functions. Given a set of observed trials, GP fits a surrogate model that predicts both the mean (expected value) and uncertainty (variance) at any unobserved point in the search space.
This is the key advantage over TPE: GP knows not only what value it expects at an unobserved point, but also how confident it is in that prediction.
How GP-based Bayesian Optimization Works
Optimization by GP is performed by repeatedly executing STEP2 to STEP3.
STEP1
Random sampling is first performed to gather initial observations, just like TPE. The number of initial trials is controlled by the "Number of startup trials" setting.
STEP2
Given the observed data, GP fits a surrogate model. For each candidate point x
in the search space, the model provides:
- A predicted mean
μ(x): the expected objective value atx - A predicted variance
σ²(x): the uncertainty of the prediction atx
STEP3
An acquisition function selects the next point to evaluate. Tunny uses Expected Improvement (EI), which balances:
- Exploitation: searching near known good points (low
μ(x)) - Exploration: searching in uncertain regions (high
σ(x))
The next trial is run at the point that maximizes EI.
GP Optuna vs GP BoTorch
Tunny provides two GP implementations:
| GP Optuna | GP BoTorch | |
|---|---|---|
| Backend | Optuna native | BoTorch library |
| Multi-objective | Not supported | Supported |
| Constraints | Supported | Supported |
| Speed | Faster | Slower |
If you are not doing multi-objective optimization, GP Optuna is recommended. GP BoTorch is required for multi-objective Bayesian optimization.
Time Complexity
GP requires computing the inverse of the kernel matrix at each step, which has
time complexity O(n³) where n is the number of observations. This makes GP
slower than TPE (O(dn log n)) as the number of trials grows. For a large
number of trials, TPE is often more practical.
When to Use GP
- Each trial is expensive and you want maximum sample efficiency
- The objective function is smooth and continuous
- The number of variables is relatively small (GP scales poorly with dimension)