In machine learning models, hyper-parameters are the parameters which are not derived while training such as weights but are fundamental to control the learning process and consequently controls model efficacy. For example, regularization parameter is the type of most common hyper-parameter used in almost all types of ML algorithms including deep learning, XGBoost, Lasso, Ridge and Elastic regressions, etc.
The time required to train and test a model can depend upon the choice of its hyper-parameters, e.g., choice of a smaller learning rate (0.0001) can exponentially increase training time. Hyper-parameters are of both continuous and integer types which implies that their best combination can lay anywhere in an infinite search space. The existence of some hyper-parameters can be conditional upon the value of others, e.g., the size of each hidden layer in a neural network can be conditional upon the number of layers in the network.
Most performance models can be attributed to just a few hyper-parameters such as Lasso, Ridge and Elastic regressions whereas in other cases they can grow into an unknown large integer, e.g., number of layers in a neural network. Models such as Deep Belief Networks (DBNs), stacked denoising autoencoders, convolutional networks and XGBoost, have from ten to perhaps fifty hyper-parameters, depending on how the experimenter chooses to parametrize the model, and how many hyper-parameters the experimenter chooses to fix at a reasonable default. But in practice, exceptional amount of performance boosts are achieved by hyper-parameter tuning, sometimes alternatively quoted as model tuning.
Hyper-parameter search is exhaustive. There exists three main approaches for searching best hyper-parameter in the search space:
- Grid Search
- Random Search
- Bayesian Optimization
There is yet another approach of particle swarm optimization which we will not touch in this post.
Grid Search
This is the most basic hyper-parameter tuning method in which parameters are chosen over a certain range and are searched in a grid like manner. The tuning algorithm exhaustively searches this space in a sequential manner and trains a model for every possible combination of hyper-parameter values. This methods guarantees best set of hyper-parameter values but can take infinitely long time as each possible combination of parameter values is tested.
Random Search
Unlike grid search, random search no longer provides an explicit set of possible values for each hyper-parameter. It rather provides a statistical distribution for each hyper-parameter from which values are randomly sampled. I.e., a sampling distribution is defined for each hyper-parameter to carry out a randomized search. This drastically reduces the search time but results may not be as good as in grid search approach.
Bayesian Optimization
The above two approaches have their limitations and benefits, i.e., grid search provides better tuned parameters but long convergence time whereas the random search provides early convergence at the expense of mediocre results. Bayesian parameter search optimization approach meets in the middle of above two by providing best parameter values with reduced convergence time. In the above methods, the parameter values between two consecutive iterations are not connected, i.e., the combination of parameter values in one iteration gives no information about the next choice of parameters. All these iterations are independent of each other. Since each experiment is performed independently, it is not possible to use the information from one iteration to improve the next iteration.
This is exactly where bayesian approach comes in handy. Since model selection and hyper-parameter optimization is crucial in applying machine learning to a novel dataset, researcher has provided a new approach on solving this by the so called Sequential Model-based Bayesian Optimization (SMBO). SMBO iterates between fitting models and using them to make choices about which configurations to investigate.
Bayesian optimization is a sequential model-based optimization (SMBO) approach which based on the results from the previous iteration, decides the values of hyper-parameters for the next iteration such that the model performance is improved.
Before we proceed to the implementation part, let us first lay down the theoretical foundations. Bayesian optimization is a powerful toll for optimizing objective functions which are very costly or slow to evaluate. Specifically, when the maximum of an expensive function is sought for parameters \( \theta \) : \( f: \theta \rightarrow \mathbb{R} \) \begin{equation} \theta_{opt} = \mathop{\arg\,\max}\limits_{\theta \in \Theta} f(\theta) \end{equation}
with parameter domain \( \Theta \subset \mathbb{R}^d \). Here, \(d\) is particularly small although the cost evaluating model variations is high. A relevant post on loss functions can be read at BLOG - WHAT ARE LOSS FUNCTIONS .
Bayesian optimization methods can be differentiated at a high level by their regression models and acquisition functions. There exits several open source Bayesian optimization software packages differentiated on the basis of the types of regression and acquisitions functions they offer/implement. We will give a brief overview of those packages but first we give a brief introduction to surrogate and acquisition functions.
A surrogate function is a function that approximates another function. This comes handy as it takes relatively lesser effort to evaluate. Therefore, it is used as a replacement to an expensive back box function. For example, to search for a point that minimizes an objective function which is expensive to optimize, simply evaluate its surrogate on thousands of points, and take the best value as an approximation to the minimizer of the objective function.
In applications where the true fitness function \(f : \theta \rightarrow \mathbb{R} \) is costly to evaluate, SMBO algorithms approximate \(f\) with a surrogate that is cheaper to evaluate. Surrogate functions are modeled as the conditional probability \( P(f|D) \) of an objective function \(f\), given the available data \(D\). Typically, a probabilistic regression model such as a Gaussian Process or Random Forest is used.
Acquisition functions are mathematical techniques that dictates how the hyper-parameter space should be searched. They oscillates between exploration and exploitation trade-off. I.e., determine what areas in the domain of \(f(\theta)\) are worth exploiting and what areas are worth exploring. Exploration means trying totally different things from areas that have not yet been probed whereas exploitation means trying slightly different things that have already been proven to be good solutions. Here, the possible choices are Expected Improvement (EI), Lower Confidence Bound (LCB) and Probability of Improvement (PI).
\begin{algorithm} \caption{Sequential Model-Based Optimization (SMBO)} \begin{algorithmic} \STATE INPUT: $f, \Theta, \mathcal{S}, \mathcal{M}$ \STATE $\mathcal{D} \leftarrow$ INITSAMPLES $(f, \mathcal{M})$ \FOR {$i \leftarrow |\mathcal{D}|$ \TO $\mathcal{T}$} \STATE $p\,(y \,|\,\theta, \mathcal{D}) \leftarrow$ FITMODEL$(\mathcal{M},\mathcal{D})$ \STATE $\theta_i \leftarrow \mathop{\arg\,\max}\limits_{\theta \in \Theta} \mathcal{S}(\theta, p\,(y \,|\, \theta, \mathcal{D}))$ \STATE $y_i \leftarrow f(\theta_i)$ $\quad$ /*EXPENSIVE STEP*/ \STATE $\mathcal{D} \cup \mathcal{D} (\theta_i, y_i)$ \ENDFOR \end{algorithmic} \end{algorithm}
Each suggested function evaluation produces an observed $y_i \leftarrow f(\theta_i)$ which is then appended to the historical set $\mathcal{D} = \{(\theta_1, y_1), . . . , (\theta_i, y_i)\}$. This dataset $\mathcal{D}$ is then use to update the regression model for generating the next suggestion for the next iteration. As it is also a heuristic approach, pratice demands that there has to be a quota on the total time or resources available, which imposes a limit $\mathcal{T}$ on the total number of function evaluations.
There exists several SMBO libraries, each offering a different flavor of either $\mathcal{M}$ or $\mathcal{S}$ or both. E.g., both Spearmint and MOE uses Gaussian Process as probabilistic regression model, Hyperopt uses Tree Parzen Estimators, whereas SMAC is based on Random Forest. All of the above libraries use EI as their acquisition function.
The two most commonly used libraries on SMBO are scikit-optimize library and Optuna. A hands on example of scikit-optimize can be read at BLOG - HYPER-PARAMETER OPTIMIZATION USING SCIKIT-OPTIMIZE ON TELCO CHURN DATA and for Optune at BLOG - HYPER-PARAMETER OPTIMIZATION USING OPTUNA.
Leave a Comment
Your email address will not be published.