Bayesian Optimization

Bayesian Optimization

Setup: https://github.com/fmfn/BayesianOptimization

link youtobe: https://www.youtube.com/watch?v=BQ4kVn-Rt84

Guide link: https://machinelearningmastery.com/what-is-bayesian-optimization/

 Install Jupyter Notebook

I will guide you to install Jupyter Notebook in Windows, initially you have to make sure you have Python installed in your machine.

You can check by going to Terminal or Command Prompt and entering the following command:

If it shows the version of Python you are using, you have it installed. Otherwise, go to the Microsoft Store or the original Python site to download and install it.

Next, to install Jupyter Notebook, you just need to use the following simple pip command

Once the installation is complete, you will see a message similar to this

How to use Jupyter Notebook?

After installing Jupyter Notebook in the steps above, we can start running Jupyter with the command in Terminal or Command Prompt:

Then the command interpreter will display the following message

In addition, the browser will also pop up with the path:

http://localhost:8888/tree as shown below:

Using Jupyter Basics

  1. Create a basic notebook

In the homepage interface, to create a new Notebook document click on New and select the document type such as Python, Text file, Folder…

Notebook when newly created has the default name of Untitled. You can click on the word “Untitled” above and to the right of the Jupyter logo to manually change the name to your liking. For example here I changed the name to Hello World

Switch back to the File Manager Tab of Jupyter, you will see a new file named notebook01.ipynb with the status Running because this notebook is open. You can also shut down a notebook by pressing Shutdown.

  1. Working with notebooks

A notebook consists of many cells. When you create a new notebook, you always create an empty cell first.

The above cell is of type “Code”, which means you can type Python code and execute it right away. To execute the code, you can press the Run cell button or press Ctrl + Enter. The above cell is of type “Code”, which means you can type Python code and execute it right away. To execute the code, you can press the Run cell button or press Ctrl + Enter.

Immediate results are displayed in the box below. An empty cell will be created after you execute the code. Let’s type another Python code for testing:

And this is the result:

How to Perform Bayesian Optimization

In this section, we will explore how Bayesian Optimization works by developing an implementation from scratch for a simple one-dimensional test function.

First, we will define the test problem, then how to model the mapping of inputs to outputs with a surrogate function. Next, we will see how the surrogate function can be searched efficiently with an acquisition function before tying all of these elements together into the Bayesian Optimization procedure.

Test Problem

The first step is to define a test problem.

We will use a multimodal problem with five peaks, calculated as:

  • y = x^2 * sin(5 * PI * x)^6

Where x is a real value in the range [0,1] and PI is the value of pi.

We will augment this function by adding Gaussian noise with a mean of zero and a standard deviation of 0.1. This will mean that the real evaluation will have a positive or negative random value added to it, making the function challenging to optimize.

The objective() function below implements this.

We can test this function by first defining a grid-based sample of inputs from 0 to 1 with a step size of 0.01 across the domain.

We can then evaluate these samples using the target function without any noise to see what the real objective function looks like.

We can then evaluate these same points with noise to see what the objective function will look like when we are optimizing it.

We can look at all of the non-noisy objective function values to find the input that resulted in the best score and report it. This will be the optima, in this case, maxima, as we are maximizing the output of the objective function.

We would not know this in practice, but for out test problem, it is good to know the real best input and output of the function to see if the Bayesian Optimization algorithm can locate it.

Finally, we can create a plot, first showing the noisy evaluation as a scatter plot with input on the x-axis and score on the y-axis, then a line plot of the scores without any noise

The complete example of reviewing the test function that we wish to optimize is listed below.

Running the example first reports the global optima as an input with the value 0.9 that gives the score 0.81.

A plot is then created showing the noisy evaluation of the samples (dots) and the non-noisy and true shape of the objective function (line).

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Now that we have a test problem, let’s review how to train a surrogate function.

Surrogate Function

The surrogate function is a technique used to best approximate the mapping of input examples to an output score.

Probabilistically, it summarizes the conditional probability of an objective function (f), given the available data (D) or P(f|D).

A number of techniques can be used for this, although the most popular is to treat the problem as a regression predictive modeling problem with the data representing the input and the score representing the output to the model. This is often best modeled using a random forest or a Gaussian Process.

A Gaussian Process, or GP, is a model that constructs a joint probability distribution over the variables, assuming a multivariate Gaussian distribution. As such, it is capable of efficient and effective summarization of a large number of functions and smooth transition as more observations are made available to the model.

This smooth structure and smooth transition to new functions based on data are desirable properties as we sample the domain, and the multivariate Gaussian basis to the model means that an estimate from the model will be a mean of a distribution with a standard deviation; that will be helpful later in the acquisition function.

As such, using a GP regression model is often preferred.

We can fit a GP regression model using the GaussianProcessRegressor scikit-learn implementation from a sample of inputs (X) and noisy evaluations from the objective function (y).

First, the model must be defined. An important aspect in defining the GP model is the kernel. This controls the shape of the function at specific points based on distance measures between actual data observations. Many different kernel functions can be used, and some may offer better performance for specific datasets.

By default, a Radial Basis Function, or RBF, is used that can work well.

Once defined, the model can be fit on the training dataset directly by calling the fit() function.

The defined model can be fit again at any time with updated data concatenated to the existing data by another call to fit().

The model will estimate the cost for one or more samples provided to it.

The model is used by calling the predict() function. The result for a given sample will be a mean of the distribution at that point. We can also get the standard deviation of the distribution at that point in the function by specifying the argument return_std=True; for example:

This function can result in warnings if the distribution is thin at a given point we are interested in sampling.

Therefore, we can silence all of the warnings when making a prediction. The surrogate() function below takes the fit model and one or more samples and returns the mean and standard deviation estimated costs whilst not printing any warnings.

We can call this function any time to estimate the cost of one or more samples, such as when we want to optimize the acquisition function in the next section.

For now, it is interesting to see what the surrogate function looks like across the domain after it is trained on a random sample.

We can achieve this by first fitting the GP model on a random sample of 100 data points and their real objective function values with noise. We can then plot a scatter plot of these points. Next, we can perform a grid-based sample across the input domain and estimate the cost at each point using the surrogate function and plot the result as a line.

We would expect the surrogate function to have a crude approximation of the true non-noisy objective function.

The plot() function below creates this plot, given the random data sample of the real noisy objective function and the fit model.

Tying this together, the complete example of fitting a Gaussian Process regression model on noisy samples and plotting the sample vs. the surrogate function is listed below.

Running the example first draws the random sample, evaluates it with the noisy objective function, then fits the GP model.

The data sample and a grid of points across the domain evaluated via the surrogate function are then plotted as dots and a line respectively.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, as we expected, the plot resembles a crude version of the underlying non-noisy objective function, importantly with a peak around 0.9 where we know the true maxima is located.

Next, we must define a strategy for sampling the surrogate function.

Acquisition Function

The surrogate function is used to test a range of candidate samples in the domain.

From these results, one or more candidates can be selected and evaluated with the real, and in normal practice, computationally expensive cost function.

This involves two pieces: the search strategy used to navigate the domain in response to the surrogate function and the acquisition function that is used to interpret and score the response from the surrogate function.

A simple search strategy, such as a random sample or grid-based sample, can be used, although it is more common to use a local search strategy, such as the popular BFGS algorithm. In this case, we will use a random search or random sample of the domain in order to keep the example simple.

This involves first drawing a random sample of candidate samples from the domain, evaluating them with the acquisition function, then maximizing the acquisition function or choosing the candidate sample that gives the best score. The opt_acquisition() function below implements this.

The acquisition function is responsible for scoring or estimating the likelihood that a given candidate sample (input) is worth evaluating with the real objective function.

We could just use the surrogate score directly. Alternately, given that we have chosen a Gaussian Process model as the surrogate function, we can use the probabilistic information from this model in the acquisition function to calculate the probability that a given sample is worth evaluating.

There are many different types of probabilistic acquisition functions that can be used, each providing a different trade-off for how exploitative (greedy) and explorative they are.

Three common examples include:

  • Probability of Improvement (PI).
  • Expected Improvement (EI).
  • Lower Confidence Bound (LCB).

The Probability of Improvement method is the simplest, whereas the Expected Improvement method is the most commonly used.

In this case, we will use the simpler Probability of Improvement method, which is calculated as the normal cumulative probability of the normalized expected improvement, calculated as follows:

  • PI = cdf((mu – best_mu) / stdev)

Where PI is the probability of improvement, cdf() is the normal cumulative distribution function, mu is the mean of the surrogate function for a given sample x, stdev is the standard deviation of the surrogate function for a given sample x, and best_mu is the mean of the surrogate function for the best sample found so far.

We can add a very small number to the standard deviation to avoid a divide by zero error.

The acquisition() function below implements this given the current training dataset of input samples, an array of new candidate samples, and the fit GP model.

Complete Bayesian Optimization Algorithm

We can tie all of this together into the Bayesian Optimization algorithm.

The main algorithm involves cycles of selecting candidate samples, evaluating them with the objective function, then updating the GP model.

The complete example is listed below.

Running the example first creates an initial random sample of the search space and evaluation of the results. Then a GP model is fit on this data.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

A plot is created showing the raw observations as dots and the surrogate function across the entire domain. In this case, the initial sample has a good spread across the domain and the surrogate function has a bias towards the part of the domain where we know the optima is located.

The algorithm then iterates for 100 cycles, selecting samples, evaluating them, and adding them to the dataset to update the surrogate function, and over again.

Each cycle reports the selected input value, the estimated score from the surrogate function, and the actual score. Ideally, these scores would get closer and closer as the algorithm converges on one area of the search space.

Next, a final plot is created with the same form as the prior plot.

This time, all 200 samples evaluated during the optimization task are plotted. We would expect an overabundance of sampling around the known optima, and this is what we see, with may dots around 0.9. We also see that the surrogate function has a stronger representation of the underlying target domain.

Finally, the best input and its objective function score are reported.

We know the optima has an input of 0.9 and an output of 0.810 if there was no sampling noise.

Given the sampling noise, the optimization algorithm gets close in this case, suggesting an input of 0.905.

Hyperparameter Tuning With Bayesian Optimization

It can be a useful exercise to implement Bayesian Optimization to learn how it works.

In practice, when using Bayesian Optimization on a project, it is a good idea to use a standard implementation provided in an open-source library. This is to both avoid bugs and to leverage a wider range of configuration options and speed improvements.

Two popular libraries for Bayesian Optimization include Scikit-Optimize and HyperOpt. In machine learning, these libraries are often used to tune the hyperparameters of algorithms.

Hyperparameter tuning is a good fit for Bayesian Optimization because the evaluation function is computationally expensive (e.g. training models for each set of hyperparameters) and noisy (e.g. noise in training data and stochastic learning algorithms).

In this section, we will take a brief look at how to use the Scikit-Optimize library to optimize the hyperparameters of a k-nearest neighbor classifier for a simple test classification problem. This will provide a useful template that you can use on your own projects.

The Scikit-Optimize project is designed to provide access to Bayesian Optimization for applications that use SciPy and NumPy, or applications that use scikit-learn machine learning algorithms.

First, the library must be installed, which can be achieved easily using pip; for example:

It is also assumed that you have scikit-learn installed for this example.

Once installed, there are two ways that scikit-optimize can be used to optimize the hyperparameters of a scikit-learn algorithm. The first is to perform the optimization directly on a search space, and the second is to use the BayesSearchCV class, a sibling of the scikit-learn native classes for random and grid searching.

In this example, will use the simpler approach of optimizing the hyperparameters directly.

The first step is to prepare the data and define the model. We will use a simple test classification problem via the make_blobs() function with 500 examples, each with two features and three class labels. We will then use a KNeighborsClassifier algorithm.

Next, we must define the search space.

In this case, we will tune the number of neighbors (n_neighbors) and the shape of the neighborhood function (p). This requires ranges be defined for a given data type. In this case, they are Integers, defined with the min, max, and the name of the parameter to the scikit-learn model. For your algorithm, you can just as easily optimize Real() and Categorical() data types.

Next, we need to define a function that will be used to evaluate a given set of hyperparameters. We want to minimize this function, therefore smaller values returned must indicate a better performing model.

We can use the use_named_args() decorator from the scikit-optimize project on the function definition that allows the function to be called directly with a specific set of parameters from the search space.

As such, our custom function will take the hyperparameter values as arguments, which can be provided to the model directly in order to configure it. We can define these arguments generically in python using the **params argument to the function, then pass them to the model via the set_params(**) function.

Now that the model is configured, we can evaluate it. In this case, we will use 5-fold cross-validation on our dataset and evaluate the accuracy for each fold. We can then report the performance of the model as one minus the mean accuracy across these folds. This means that a perfect model with an accuracy of 1.0 will return a value of 0.0 (1.0 – mean accuracy).

This function is defined after we have loaded the dataset and defined the model so that both the dataset and model are in scope and can be used directly.

Next, we can perform the optimization.

This is achieved by calling the gp_minimize() function with the name of the objective function and the defined search space.

By default, this function will use a ‘gp_hedge‘ acquisition function that tries to figure out the best strategy, but this can be configured via the acq_func argument. The optimization will also run for 100 iterations by default, but this can be controlled via the n_calls argument.

Once run, we can access the best score via the “fun” property and the best set of hyperparameters via the “x” array property.

 

Tying this all together, the complete example is listed below.

 

Running the example executes the hyperparameter tuning using Bayesian Optimization.

The code may report many warning messages, such as:

This is to be expected and is caused by the same hyperparameter configuration being evaluated more than once.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the model achieved about 97% accuracy via mean 5-fold cross-validation with 3 neighbors and a p-value of 2.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Buy us some coffee

Thank You for your support as we work to give you the best of guides and articles.

top