Consider the multilayer neural network below where h1 h2 ol
Solution
One of the problems that occur during neural network training is called overfitting. The error on the training set is driven to a very small value, but when new data is presented to the network the error is large. The network has memorized the training examples, but it has not learned to generalize to new situations.
One method for improving network generalization is to use a network that is just large enough to provide an adequate fit. The larger network you use, the more complex the functions the network can create. If you use a small enough network, it will not have enough power to overfit the data. Run the Neural Network Design example nnd11gn [HDB96] to investigate how reducing the size of a network can prevent overfitting.
Unfortunately, it is difficult to know beforehand how large a network should be for a specific application. There are two other methods for improving generalization that are implemented in Neural Network Toolbox™ software: regularization and early stopping. The next sections describe these two techniques and the routines to implement them.
Note that if the number of parameters in the network is much smaller than the total number of points in the training set, then there is little or no chance of overfitting. If you can easily collect more data and increase the size of the training set, then there is no need to worry about the following techniques to prevent overfitting. The rest of this section only applies to those situations in which you want to make the most of a limited supply of data.
Retraining Neural Networks
Typically each backpropagation training session starts with different initial weights and biases, and different divisions of data into training, validation, and test sets. These different conditions can lead to very different solutions for the same problem.
It is a good idea to train several networks to ensure that a network with good generalization is found.
Here a dataset is loaded and divided into two parts: 90% for designing networks and 10% for testing them all.
Next a network architecture is chosen and trained ten times on the first part of the dataset, with each network\'s mean square error on the second part of the dataset.
Each network will be trained starting from different initial weights and biases, and with a different division of the first dataset into training, validation, and test sets. Note that the test sets are a good measure of generalization for each respective network, but not for all the networks, because data that is a test set for one network will likely be used for training or validation by other neural networks. This is why the original dataset was divided into two parts, to ensure that a completely independent test set is preserved.
The neural network with the lowest performance is the one that generalized best to the second part of the dataset.
Multiple Neural Networks
Another simple way to improve generalization, especially when caused by noisy data or a small dataset, is to train multiple neural networks and average their outputs.
For instance, here 10 neural networks are trained on a small problem and their mean squared errors compared to the means squared error of their average.
First, the dataset is loaded and divided into a design and test set.
Then, ten neural networks are trained.
Next, each network is tested on the second dataset with both individual performances and the performance for the average output calculated.
The mean squared error for the average output is likely to be lower than most of the individual performances, perhaps not all. It is likely to generalize better to additional new data.
For some very difficult problems, a hundred networks can be trained and the average of their outputs taken for any input. This is especially helpful for a small, noisy dataset in conjunction with the Bayesian Regularization training function trainbr, described below.
Early Stopping
The default method for improving generalization is called early stopping. This technique is automatically provided for all of the supervised network creation functions, including the backpropagation network creation functions such as feedforwardnet.
In this technique the available data is divided into three subsets. The first subset is the training set, which is used for computing the gradient and updating the network weights and biases. The second subset is the validation set. The error on the validation set is monitored during the training process. The validation error normally decreases during the initial phase of training, as does the training set error. However, when the network begins to overfit the data, the error on the validation set typically begins to rise. When the validation error increases for a specified number of iterations (net.trainParam.max_fail), the training is stopped, and the weights and biases at the minimum of the validation error are returned.
The test set error is not used during training, but it is used to compare different models. It is also useful to plot the test set error during the training process. If the error in the test set reaches a minimum at a significantly different iteration number than the validation set error, this might indicate a poor division of the data set.
There are four functions provided for dividing data into training, validation and test sets. They are dividerand (the default), divideblock, divideint, and divideind. You can access or change the division function for your network with this property:
Each of these functions takes parameters that customize its behavior. These values are stored and can be changed with the following network property:
Index Data Division (divideind)
Create a simple test problem. For the full data set, generate a noisy sine wave with 201 input points ranging from 1 to 1 at steps of 0.01:
Divide the data by index so that successive samples are assigned to the training set, validation set, and test set successively:

