# Understanding and Optimizing Neural Network Hyperparameters part 2: The Neurons

## Intro

In this part 2 will discuss how neural nets are structured, how the structure can affect the success of a model, and why. The actual design of a neural network is very much determined by the data that’s supplied. You will already know how outside layers are formed, where each bit of data is usually referred to an individual neuron input, and the output layer is formed from the target. It is the hidden layers that are flexible for contributing to success. You must choose the best amount of hidden layers, and the best amount of neurons on each hidden layer. Bare in mind this construction can be represented as a 2D array, where 10,20,5 would be a network of 3 layers, with 10 input layers, 20 neurons on the hidden layer, and 5 on the output.

### Understanding and Optimizing Neural Network Hyperparameters Series

**The outside layers**

With many problems, a dataset will have a consistent amount of numeric input values. The amount of fields will decide the amount of neurons on the input layer. However, some cases involve data that have classified non-numeric values. For example, if data was using terms such as “Big”, “Small”, “Medium” instead of a numeric size, we would no longer use a single input neuron to represent the single size field. Instead we would use a method known as one-hot encoding. Using this method, we simply create a vector where its length is the amount of classes of the field. So in this example, it would be 3. Then, instead of referring to the class as the original alphabetic phrase, we refer to it as a simple binary vector, where “Big” could be **[1,0,0]**, “Small would be **[0,1,0]**, and “Medium” would be **[0,0,1]**. Each value of the vector gets its own input neuron. What we have essentially done is give a neuron to each class of a field, not the field itself, all because we need numeric values.

Using the preprocessing feature of my tool Perceptron, here is a few example rows of data that uses alpha values as classifications.

If I then apply one hot encoding, the program will find all classifications per field, and populate binary vectors for each item. The result looks like this

Do not mistake this as an efficient way to handle language processing, as really it is for problems involving simple classification fields. However, there have been vast NLP neural networks that are based entirely on one-hot encoding!

The basic idea of this, combined with even more simple one field one neuron inputs, will make up an input layer. There maybe manipulations that happen to the data along the way, but structure wise, it is this simple for all common problems.

Just like the input layer, the output layer and its neurons are determined by the data being used. In this case, by the target values. Most simply, again like the input layer, the amount of target values (or values to predict) decide the amount on neurons on the output layer. However, the targets could also be a classification type, so the same one-hot encoding is applicable. You could have three output neurons, each representing an item of a binary vector, where one will be *hot *or on, for each data row. After thousands of feed forwards, neuron prediction values could look like **[0.0034, 0.984, 0.011]**, where the actual target is **[0,1,0]**. Of course, there is an obvious threshold to recognise which neuron is firing as true.

**The hidden layers**

The hidden layers are the flexible part of a neural network. They are all the layers in between the outside layers, and have no direct relationship to the shape of your data, giving you the liberty to best optimize the amount of hidden layers, and amount of neurons on each hidden layer. I have always considered these neurons to affect the *nonlinear mass of observation*. Like the learning rate, there can be too much or too little. However, instead of putting stress on an error of observation, the amount of hidden neurons determine what is actually observed in data, and by how much.

Assume we have a network designed to solve the MNIST problem, and at first we decide to only have 5 neurons on the hidden layer. For now disregard why the outside layers are designed in such a way, or read up about the MNIST problem.

As a result of using 5 neurons, we get the following error rate.

This clearly remains much too high. Naturally we can assume for a neural network with 784 input neurons, the hidden layer’s count should be much higher. Let’s try 200 hidden neurons. Note that the image below is obviously not literally showing all 748,200 neurons.

As a result of the 200, we get much better success.

It is worth mentioning that the colour of the lines do not reflect upon their success at all. Just used for showing the different models they represent. The error is still very high at 40%. Finally, we can try what I know to be the best for this problem, about 40 neurons.

You will notice that the different hidden layer designs seem to vaguely affect the transformation of the lines on the graph. If you have too many or too little, you could end up with the same amount of error. This makes it more difficult to manually optimize.

There are some good rules of thumb when trying to come up with your first hidden layer design. The most simple is multiplying the input neuron count by the output neuron count then square rooting the result. However, a more interesting way of doing it is taking into account the dataset size:

\(N_{h} = \frac{ N_{s} }{ \alpha \cdot (N_{i} + N_{o}) }\)

Where \(N_{h}\) is the hidden layer count, \(N_{s}\) is the number of samples in your dataset, \(N_{i}\) is the number of input neurons, and \(N_{o}\) is the number of output neurons. \(\alpha\) is a scale factor that should be optimized. Normally, it is between 4 and 9.

You may also be wondering about multiple hidden layers, instead of just the one. A vast majority of the time, only one hidden layer should be used. In fact, multiple layers can cause a loss of understanding by essentially over looking the problem. More specifically, during back propagation, if the path back to each weight is too complex, it becomes difficult to compute accurate error responsibilities. This is known as vanishing gradient descent. When you hear about deep learning, which by definition refers to neural networks with many hidden layers, they tend to be other types of more complex neural networks, such as convolutional or recurrent. Standard feedforward nets rarely contain more than one hidden layer.

©ODSC2017