The mystery of regularization solved

Previously, we talked about bias and variance. And we learned that high variance means overfitting.

But what can we do to deal with overfitting once it happens?

A go-to technique is to use regularization. This is not special to deep learning. Regularization is used for all types of machine learning models.

It sounds like a big, complex term but its logic is very simple. Regularization is a way to limit the flexibility of your model.

It makes sense that it would be helpful with dealing with overfitting right? Because we saw that models that have a lot of flexibility (like decision trees) tend to have high variance. And we all know what happens when variance is high.

Okay okay, sounds cool and all but what does it mean to limit the flexibility of your model? Take a look at this small network.

If you remember, models learn through their parameters. Parameters are first initialized to a random number, and as examples are fed to the model, it corrects the parameters to output a more correct prediction.Our parameters in this example are w1, w2, w3 and w4. It is not a very complex model.

The more hidden layers and the more neurons we have, the more complex our model will be. This will give the network the flexibility to learn more complex data patterns. Regularization limits this flexibility by keeping the parameters of the network to a certain threshold.

This threshold depends on the approach. Some commonly used ones are L1, L2 or dropout regularization. (Early stopping is also used as a type of regularization sometimes but that might not be very preferable as it interrupts the learning process of a network.)

One thing that confused me when I was learning regularization was the connection between keeping the parameters low and overfitting. I couldn't understand why high weights meant a model that has overfit.

But it's actually quite straightforward.

Below is what an input to output graph would look like if the model has overfit. Models we train of course have more than one feature in the input and there are many parameters that contribute to the output. But let's just take this small example for simplicity.

Red line is the pattern that the model learned

Recalling the fundamentals, the weight of an input feature is the importance the model has seen suitable for it.

Overfitting is the model exaggerating the importance of a certain input feature. Thus, high weights would mean that the importance of an input feature is being exaggerated. That's why keeping them low helps battle overfitting.

We will of course go into detail in the course on how each of the regularization approaches work and how to implement them in Python using Keras. But for now, knowing that:

  • high variance causes overfitting,
  • regularization helps lower overfitting,
  • regularization is a way to limit the flexibility of the model

is a good start!

Talking about the course, I updated the table of contents today. Take a look here.