Consider the multilayer neural network below where h1 h2 ol

Consider the multi-layer neural network below, where h1, h2, ol, o2 are threshold units. The current weights on the network are shown in red. The training example inputs and outputs are shown in blue under il, i2, ol, o2 units. Assume that the threshold units all use logistic activation function 1/(1 + exp(-x)) For the training example (0.5, 0.1) that is given what is the half-squared error at the units o1 and o2? (do this by showing the activations at h1, h2, o1 and o2. You can use calculators) Clearly the weights w5, w6, w7 and w8 can be modified using the gradient update rule that is in terms of the total error at o1 and o2. To update the weights wl, w2, w3, w4 however, we need to know how much error is made at h1 and h2. The problem is that we don\'t know what the true values are supposed to be at h1 and h2 for the current input. How do neural network training algorithms handle this problem?

Solution

One of the problems that occur during neural network training is called overfitting. The error on the training set is driven to a very small value, but when new data is presented to the network the error is large. The network has memorized the training examples, but it has not learned to generalize to new situations.

One method for improving network generalization is to use a network that is just large enough to provide an adequate fit. The larger network you use, the more complex the functions the network can create. If you use a small enough network, it will not have enough power to overfit the data. Run the Neural Network Design example nnd11gn [HDB96] to investigate how reducing the size of a network can prevent overfitting.

Unfortunately, it is difficult to know beforehand how large a network should be for a specific application. There are two other methods for improving generalization that are implemented in Neural Network Toolbox™ software: regularization and early stopping. The next sections describe these two techniques and the routines to implement them.

Note that if the number of parameters in the network is much smaller than the total number of points in the training set, then there is little or no chance of overfitting. If you can easily collect more data and increase the size of the training set, then there is no need to worry about the following techniques to prevent overfitting. The rest of this section only applies to those situations in which you want to make the most of a limited supply of data.

Retraining Neural Networks

Typically each backpropagation training session starts with different initial weights and biases, and different divisions of data into training, validation, and test sets. These different conditions can lead to very different solutions for the same problem.

It is a good idea to train several networks to ensure that a network with good generalization is found.

Here a dataset is loaded and divided into two parts: 90% for designing networks and 10% for testing them all.

Next a network architecture is chosen and trained ten times on the first part of the dataset, with each network\'s mean square error on the second part of the dataset.

Each network will be trained starting from different initial weights and biases, and with a different division of the first dataset into training, validation, and test sets. Note that the test sets are a good measure of generalization for each respective network, but not for all the networks, because data that is a test set for one network will likely be used for training or validation by other neural networks. This is why the original dataset was divided into two parts, to ensure that a completely independent test set is preserved.

The neural network with the lowest performance is the one that generalized best to the second part of the dataset.

Multiple Neural Networks

Another simple way to improve generalization, especially when caused by noisy data or a small dataset, is to train multiple neural networks and average their outputs.

For instance, here 10 neural networks are trained on a small problem and their mean squared errors compared to the means squared error of their average.

First, the dataset is loaded and divided into a design and test set.

Then, ten neural networks are trained.

Next, each network is tested on the second dataset with both individual performances and the performance for the average output calculated.

The mean squared error for the average output is likely to be lower than most of the individual performances, perhaps not all. It is likely to generalize better to additional new data.

For some very difficult problems, a hundred networks can be trained and the average of their outputs taken for any input. This is especially helpful for a small, noisy dataset in conjunction with the Bayesian Regularization training function trainbr, described below.

Early Stopping

The default method for improving generalization is called early stopping. This technique is automatically provided for all of the supervised network creation functions, including the backpropagation network creation functions such as feedforwardnet.

In this technique the available data is divided into three subsets. The first subset is the training set, which is used for computing the gradient and updating the network weights and biases. The second subset is the validation set. The error on the validation set is monitored during the training process. The validation error normally decreases during the initial phase of training, as does the training set error. However, when the network begins to overfit the data, the error on the validation set typically begins to rise. When the validation error increases for a specified number of iterations (net.trainParam.max_fail), the training is stopped, and the weights and biases at the minimum of the validation error are returned.

The test set error is not used during training, but it is used to compare different models. It is also useful to plot the test set error during the training process. If the error in the test set reaches a minimum at a significantly different iteration number than the validation set error, this might indicate a poor division of the data set.

There are four functions provided for dividing data into training, validation and test sets. They are dividerand (the default), divideblock, divideint, and divideind. You can access or change the division function for your network with this property:

Each of these functions takes parameters that customize its behavior. These values are stored and can be changed with the following network property:

Index Data Division (divideind)

Create a simple test problem. For the full data set, generate a noisy sine wave with 201 input points ranging from 1 to 1 at steps of 0.01:

Divide the data by index so that successive samples are assigned to the training set, validation set, and test set successively:

 Consider the multi-layer neural network below, where h1, h2, ol, o2 are threshold units. The current weights on the network are shown in red. The training exam
 Consider the multi-layer neural network below, where h1, h2, ol, o2 are threshold units. The current weights on the network are shown in red. The training exam

Get Help Now

Submit a Take Down Notice

Tutor
Tutor: Dr Jack
Most rated tutor on our site