Deep Learning Fundamentals: hyperparameter tuning techniques

Hyperparameter tuning is one of the most important (and time-consuming) steps in building a successful neural network. It is important to have a good understanding of possible approaches to hyperparameter tuning to be able to efficiently make the correct decisions when it comes to tuning your network. Let’s take a quick look into why this is an issue, to begin with, and review the current techniques out there that you can use on your projects.

Neural networks are very flexible models. They have many degrees of freedom which lets them fit complex patterns in the data. But this flexibility comes at a cost known to us as the difficulty of hyperparameter tuning. Neural networks have hyperparameters such as:

Number of hidden layers
Number of neurons per layer
Learning rate
Optimizer
Batch size
Activation function
Regularization

and more...

And some of the choices we can make for these parameters come with their own hyperparameters. For example, if you choose to apply L1 regularization to your network, you need to set the alpha parameter of L1 too.

Moreover, most hyperparameters are continuous values. For example number of neurons per layer can be anything between 1 to 1000 (or more really). To simplify, we mostly use increments of 100 but still, there are many possible values this hyperparameter can have.

Even if we only have the above-mentioned hyperparameters and only try three values each on average, there will be 2187 possible permutations. If we wanted to try all of them, we would need to set up, train and evaluate 2187 networks. This is called Grid search.

But in a model with so many hyperparameters to tune, it is not feasible.

There are some other ways to approach this problem. The first one is Random Search. As suggested by the name, this approach relies on trying random values for the hyperparameters in their respective range.

Although it is a good technique for simpler models, random search cannot guarantee finding the best setting for your model. If you have a bigger model that takes a long time to train you might need to use a different approach.

One way to go about it is to do an iterative random search. After the first run of random search, evaluate the results and determine the best performing hyperparameters. And do another random search by tightening the range of your hyperparameters around the parameter values that gave the best results in the previous run. And continue in this manner. But of course, this would be a very time-consuming way to go about it if you do it manually.

There are techniques out there that take care of the iteration for you. Bayesian optimizers, for example, do a good job of estimating the relationship between hyperparameters and the loss. In my upcoming course we will go into further detail on how to use these techniques to arrive at better solutions faster.

On top of grid search, random search and bayesian search, there is still active research going on on the topic of hyperparameter tuning:

There is promising work done on Gradient-based optimization algorithms that think of the hyperparameter tuning problem as the learning problem of networks.
Evolutionary algorithms rely on the process of natural elimination, randomization and survival of the fittest to optimize network performance.
Early-stopping based algorithms work by abandoning unpromising hyperparameter settings quickly and focusing on promising ones.

Even though they are proven to perform well, state-of-the-art optimization algorithms are not yet very commonly adopted. This is mainly because these techniques are hard to apply to big real-world problems due to problems with scaling.

For now, it might be the best strategy to keep using Random Search for your small projects. But if you ever want to learn how to use more sophisticated optimizers, check the free libraries out there such as Hyperopt, Keras Tuner, Scikit-Optimize or Sklearn-Deap or sign up for my upcoming course and I’ll show you how.