Getting started in applied machine learning can sometimes be confusing. There are many terms to learn, and a lot of them are not used consistently, especially from one field of study to another — where one term might mean two different things in two different fields.
Parameter vs. Hyperparameter: What’s the difference?
Let’s start with the most basic distinction, that between a parameter and a hyperparameter.
A model parameter is a property that is internal to the model and was learned during the training phase, and are required by the model in order to make predictions.
On the other hand, a model hyperparameter cannot be learned during the training and are set beforehand. Their values might be set using a rule of thumb, or trial and error; but in truth, we can’t know the best model hyperparameter for a specific problem — so we tune the hyperparameters to discover the parameters of the model that has the best performance.
In a way, hyperparameters are the knobs we can use to tune our model.
Also called hyperparameter optimization, it is the problem of finding the set of hyperparameters with the best performance for a specific learning algorithm.
This is an important step because using the right hyperparameters will lead to the discovery of the parameters of the model that result in the most skillful predictions; which is what we want in the end.
An optimization problem of this nature has three basic components:
(1) an objective function, this is the main aim of the model which we either want to maximize or minimize;
(2) a set of variables which control the objective function;
(3) a set of constraints that allows the variables to take certain values while excluding others.
It then follows that the optimization problem is to find the set of values for the variables that maximize or minimize the objective function whilst satisfying the set of constraints.
There are different approaches to solving the optimization problem, as well as many open-source software options that implement these approaches. In this article, we’ll explore Grid Search, Random Search, and Bayesian Optimization.
Also called parameter sweep, it is considered the most simple algorithm for hyperparameter optimization. It consists of exhaustively searching through a manually specified set of parameters, meaning, to train the model for all possible combinations of the subset specified.
This approach can be a good choice if the model can train quickly, otherwise, the model would take too long to train. Which is why it is not considered a best practice to use Grid Search to tune the hyperparameters of a neural network.
A popular implementation is Scikit-Learn’s GridSearchCV. Other implementations are Talos, a library that includes Grid Search for Keras, and H2O, a platform that provides a Grid Search implementation for different machine learning algorithms.
The idea behind Random Search is quite similar to Grid Search, except that instead of exhaustively searching through all the possible combinations of the manually specified set of parameters, it will select random subsets of combinations to try.
This approach considerably reduces the time to run the hyperparameter optimization, but there’s a caveat: there are no guarantees that will find the optimal set of hyperparameters.
Some popular implementations are Scikit-Learn’s RandomizedSearchCV, Hyperopt, and Talos.
We’ve already established that Random Search and Grid Search are built upon a similar idea, another commonality they share is that they do not use previous results to inform the evaluation of each subset of hyperparameters and in turn, they spend time evaluating less than optimal options.
In contrast, Bayesian Optimization does keep track of past evaluation results, making it an adaptive approach to optimization. It is a powerful strategy to find the values of the variables that satisfy objective functions that are expensive to evaluate. In a way, Bayesian techniques are most useful when we need to minimize the number of steps we take in our attempt to find the global optimum.
Bayesian optimization incorporates prior belief about an objective function (f) and updates the prior with samples drawn from (f) to get a posterior that better approximates (f) — Bayesian Optimization by Martin Krasser (2018).
In order to achieve this, Bayesian Optimization makes use of two more concepts: a surrogate model and an acquisition function. The first one refers to a probabilistic model used to approximate the objective function, the second is used to determine new locations within the domain to sample where an improvement over the current best evaluation results is most likely. These are two key factors behind the efficiency of Bayesian Optimization models.
There are a few different choices for surrogate models, the most common ones are Gaussian Processes, Random Forest Regressions, and Tree Parzen Estimators. As for acquisition functions, the most common ones are Expected Improvement, Maximum Probability of Improvement, and Upper Confidence Bound.
Some popular implementations are Scikit-Optimize’s Bayesian Optimization, SMAC, Spearmint, MOE, and Hyperopt.
Hyperparameter optimization is a big deal for machine learning tasks. It is an important step to improve model performance, but it isn’t the only strategy out there to this end.
From all the approaches we explored, different practitioners and experts would advocate for a different method according to their experience, but many of them would also agree that it actually depends on the data, the problem, and other project considerations. The most important take-away is that it is a crucial step in finding your best performing model.