Understanding and optimizing neural network hyperparameters part 1: The learning rate

Tags: ,

Introduction to the series

When first trying to understand a neural network, one of the most debated and perhaps mysterious aspects of them are the parameters that contribute to their success. These parameters are for you to ultimately decide. As it stands, they tend to be determined by obvious methods of tuning, and trial and error. In my following articles, I will attempt to explain why they behave how they do, and give you a better understanding so you can make more informed choices and use better methods when choosing the hyperparameters of your network. I will assume you have a good understanding of how a neural network works, but maybe tend to question the purpose and effect of certain parameters. In the first parts of the series, I will talk about these certain parameters, and then in the last, I will talk about the optimization process. This series is not going to cover general neural network behaviour. The main aim of this article is to give an overview of the learning rate. A few abstract explanations, what it looks like, and how it’s applied at a basic level.

The learning rate

The learning rate (or learning rate) is very simple at heart, and is perhaps one of the most important things to get right when designing a neural net. I have always seen the learning rate as the amount of significance to assume in an observation of error. The learning rate is used when back propagating a neural network, and each weight needs to be adjusted according to how they all contributed to the feed forward error. A neural network’s success relies entirely on it’s initial error, so it can distinguish what is right and wrong. Assume we are training a network, and at first as expected, it gets most predictions wrong. Then, when it’s weights are adjusted during back propagation, the stress of these adjustments are determined by the learning rate. This idea is very applicable to real life mistakes. When someone makes a mistake, they could either think of it little, and not learn from the mistake very much, or they could be hugely struck by it and consider the mistake deeply. However, it’s possible to think too deeply about a mistake or not deeply enough about one. Treating what might just be a small mistake as a big one can lead to an overshoot of consideration, and then an exaggerated understanding. Vice versa, not thinking deeply enough will of course cause a lack of understanding.

The learning rate therefore needs to be somewhere in the middle, but the middle is dependant on the data and the problem you want to solve. Here is a few basic visual examples of how the learning rate affects total success, using my tool Perceptron.

Firstly, let’s set a learning rate that I know is too low for the particular problem.

The results show a clear decrease in error. This is of course good, our neural network is learning, and within two epochs is down to a 20% error. However, it does not decrease at a very impressive rate. Let’s try and increase the learning rate.

Having increased the learning rate too much, we can see the result (in red), shows no success at all. This is because the neural network has essentially over learnt, and assumed too much significance in each error. Bear in mind, as you increase the learning rate, you should expect a steeper decrease in error, until you reach a maximum which from there overshoots, and fails completely like this has.

Finally, let’s try and use a learning rate that is somewhere between our past attempts.

We can see a huge improvement in comparison to both past attempts. The green line decreases at a very efficient rate, and then lands at a more shallow decrease. This is the shape of a successful neural network. It is important to understand how the learning rate affects the success visually.

How the learning rate is applied

Usually, because a learning rate is essentially a multiplier to individual weight adjustments, they are applied like this

$$w = w – (\eta \cdot \Delta w )$$

Where w is the weight, eta is the learning rate, and the weight change is the gradient change determined by back propagation. However, there are much more interesting ways of applying a learning rate, such as Adagrad.

Adagrad is a method for an on going learning rate adaption. Instead of keeping the learning rate constant, it changes based on the frequency of certain values. Assume many similar inputs were fed forward, ideally the learning rate should remain small so that it doesn’t find huge differences in things that are meant to be similar. However, if an infrequent input gets fed forward, it is important the neural network gets well adjusted. To understand the frequency of a parameter, we must store previous gradient changes. In the case of Adagrad, this is done by recording the sum of squares of the gradient values (G).

$$w = w – ( \frac{\eta }{ \sqrt{G + \epsilon } } \cdot \Delta w )$$

Also notice the \epsilon which is considered a smoothing term, avoiding division by zero problems. Because Adagrad is in some ways automated, you do not need to manually optimize a global  value, it can just start at something close to 0.01. A big problem with this is that the value G, an accumulation of past parameter values, continues to increase without any reset attempt. This will result in a continuously smaller learning rate, even if the learning rate value does correspond better and more dynamically to the data (in proportion to frequency of data patterns). Another method, known as Adadelta, does attempt to limit the recorded parameter history, by resetting it over framed periods.