. 48
( 137 .)


After feeding the inputs for a customer into the network, the network calcu­
lates three values. Given all these outputs, how can the department store deter­
mine the right promotion or promotions to offer the customer? Some common
methods used when working with multiple model outputs are:
Take the department corresponding to the output with the maximum

Take departments corresponding to the outputs with the top three values.

Take all departments corresponding to the outputs that exceed some

threshold value.
Take all departments corresponding to units that are some percentage

of the unit with the maximum value.
All of these possibilities work well and have their strengths and weaknesses
in different situations. There is no one right answer that always works. In prac­
tice, you want to try several of these possibilities on the test set in order to
determine which works best in a particular situation.
There are other variations on the topology of feed-forward neural networks.
Sometimes, the input layers are connected directly to the output layer. In this
case, the network has two components. These direct connections behave like a
standard regression (linear or logistic, depending on the activation function in
the output layer). This is useful building more standard statistical models. The
hidden layer then acts as an adjustment to the statistical model.

How Does a Neural Network Learn
Using Back Propagation?
Training a neural network is the process of setting the best weights on the
edges connecting all the units in the network. The goal is to use the training set
Artificial Neural Networks 229

to calculate weights where the output of the network is as close to the desired
output as possible for as many of the examples in the training set as possible.
Although back propagation is no longer the preferred method for adjusting
the weights, it provides insight into how training works and it was the
original method for training feed-forward networks. At the heart of back prop­
agation are the following three steps:
1. The network gets a training example and, using the existing weights in
the network, it calculates the output or outputs.
2. Back propagation then calculates the error by taking the difference

between the calculated result and the expected (actual result).

3. The error is fed back through the network and the weights are adjusted
to minimize the error”hence the name back propagation because the
errors are sent back through the network.
The back propagation algorithm measures the overall error of the network
by comparing the values produced on each training example to the actual
value. It then adjusts the weights of the output layer to reduce, but not elimi­
nate, the error. However, the algorithm has not finished. It then assigns the
blame to earlier nodes the network and adjusts the weights connecting those
nodes, further reducing overall error. The specific mechanism for assigning
blame is not important. Suffice it to say that back propagation uses a compli­
cated mathematical procedure that requires taking partial derivatives of the
activation function.
Given the error, how does a unit adjust its weights? It estimates whether
changing the weight on each input would increase or decrease the error. The
unit then adjusts each weight to reduce, but not eliminate, the error. The adjust­
ments for each example in the training set slowly nudge the weights, toward
their optimal values. Remember, the goal is to generalize and identify patterns
in the input, not to memorize the training set. Adjusting the weights is like a
leisurely walk instead of a mad-dash sprint. After being shown enough training
examples during enough generations, the weights on the network no longer
change significantly and the error no longer decreases. This is the point where
training stops; the network has learned to recognize patterns in the input.
This technique for adjusting the weights is called the generalized delta rule.
There are two important parameters associated with using the generalized
delta rule. The first is momentum, which refers to the tendency of the weights
inside each unit to change the “direction” they are heading in. That is, each
weight remembers if it has been getting bigger or smaller, and momentum tries
to keep it going in the same direction. A network with high momentum
responds slowly to new training examples that want to reverse the weights. If
momentum is low, then the weights are allowed to oscillate more freely.
230 Chapter 7


Although the first practical algorithm for training networks, back propagation is
an inefficient way to train networks. The goal of training is to find the set of
weights that minimizes the error on the training and/or validation set. This type
of problem is an optimization problem, and there are several different
It is worth noting that this is a hard problem. First, there are many weights in
the network, so there are many, many different possibilities of weights to
consider. For a network that has 28 weights (say seven inputs and three hidden
nodes in the hidden layer). Trying every combination of just two values for each
weight requires testing 2^28 combinations of values”or over 250 million
combinations. Trying out all combinations of 10 values for each weight would
be prohibitively expensive.
A second problem is one of symmetry. In general, there is no single best
value. In fact, with neural networks that have more than one unit in the hidden
layer, there are always multiple optima”because the weights on one hidden
unit could be entirely swapped with the weights on another. This problem of
having multiple optima complicates finding the best solution.
One approach to finding optima is called hill climbing. Start with a random
set of weights. Then, consider taking a single step in each direction by making a
small change in each of the weights. Choose whichever small step does the
best job of reducing the error and repeat the process. This is like finding the
highest point somewhere by only taking steps uphill. In many cases, you end up
on top of a small hill instead of a tall mountain.
One variation on hill climbing is to start with big steps and gradually reduce
the step size (the Jolly Green Giant will do a better job of finding the top of
the nearest mountain than an ant). A related algorithm, called simulated
annealing, injects a bit of randomness in the hill climbing. The randomness is
based on physical theories having to do with how crystals form when liquids
cool into solids (the crystalline formation is an example of optimization in the
physical world). Both simulated annealing and hill climbing require many, many
iterations”and these iterations are expensive computationally because they
require running a network on the entire training set and then repeating again,
and again for each step.
A better algorithm for training is the conjugate gradient algorithm. This
algorithm tests a few different sets of weights and then guesses where the
optimum is, using some ideas from multidimensional geometry. Each set of
weights is considered to be a single point in a multidimensional space. After
trying several different sets, the algorithm fits a multidimensional parabola to
the points. A parabola is a U-shaped curve that has a single minimum (or
maximum). Conjugate gradient then continues with a new set of weights in this
region. This process still needs to be repeated; however, conjugate gradient
produces better values more quickly than back propagation or the various hill
climbing methods. Conjugate gradient (or some variation of it) is the preferred
method of training neural networks in most data mining tools.
Artificial Neural Networks 231

The learning rate controls how quickly the weights change. The best approach
for the learning rate is to start big and decrease it slowly as the network is being
trained. Initially, the weights are random, so large oscillations are useful to get
in the vicinity of the best weights. However, as the network gets closer to the
optimal solution, the learning rate should decrease so the network can fine-
tune to the most optimal weights.
Researchers have invented hundreds of variations for training neural net­
works (see the sidebar “Training As Optimization”). Each of these approaches
has its advantages and disadvantages. In all cases, they are looking for a tech­
nique that trains networks quickly to arrive at an optimal solution. Some
neural network packages offer multiple training methods, allowing users to
experiment with the best solution for their problems.
One of the dangers with any of the training techniques is falling into some­
thing called a local optimum. This happens when the network produces okay
results for the training set and adjusting the weights no longer improves the
performance of the network. However, there is some other combination of
weights”significantly different from those in the network”that yields a
much better solution. This is analogous to trying to climb to the top of a moun­
tain by choosing the steepest path at every turn and finding that you have only
climbed to the top of a nearby hill. There is a tension between finding the local
best solution and the global best solution. Controlling the learning rate and
momentum helps to find the best solution.

Heuristics for Using Feed-Forward,
Back Propagation Networks
Even with sophisticated neural network packages, getting the best results
from a neural network takes some effort. This section covers some heuristics
for setting up a network to obtain good results.
Probably the biggest decision is the number of units in the hidden layer. The
more units, the more patterns the network can recognize. This would argue for
a very large hidden layer. However, there is a drawback. The network might
end up memorizing the training set instead of generalizing from it. In this case,
more is not better. Fortunately, you can detect when a network is overtrained. If
the network performs very well on the training set, but does much worse on the
validation set, then this is an indication that it has memorized the training set.
How large should the hidden layer be? The real answer is that no one
knows. It depends on the data, the patterns being detected, and the type of net­
work. Since overfitting is a major concern with networks using customer data,
we generally do not use hidden layers larger than the number of inputs. A
good place to start for many problems is to experiment with one, two, and
three nodes in the hidden layer. This is feasible, especially since training neural
232 Chapter 7

networks now takes seconds or minutes, instead of hours. If adding more
nodes improves the performance of the network, then larger may be better.
When the network is overtraining, reduce the size of the layer. If it is not suffi­
ciently accurate, increase its size. When using a network for classification,
however, it can be useful to start with one hidden node for each class.
Another decision is the size of the training set. The training set must be suffi­
ciently large to cover the ranges of inputs available for each feature. In addition,
you want several training examples for each weight in the network. For a net­
work with s input units, h hidden units, and 1 output, there are h * (s + 1) + h + 1
weights in the network (each hidden layer node has a weight for each connec­
tion to the input layer, an additional weight for the bias, and then a connection
to the output layer and its bias). For instance, if there are 15 input features and
10 units in the hidden network, then there are 171 weights in the network.

There should be at least 30 examples for each weight, but a better minimum is

100. For this example, the training set should have at least 17,100 rows.
Finally, the learning rate and momentum parameters are very important for
getting good results out of a network using the back propagation training
algorithm (it is better to use conjugate gradient or similar approach). Initially,
the learning should be set high to make large adjustments to the weights.
As the training proceeds, the learning rate should decrease in order to fine-

tune the network. The momentum parameter allows the network to move
toward a solution more rapidly, preventing oscillation around less useful

Choosing the Training Set
The training set consists of records whose prediction or classification values
are already known. Choosing a good training set is critical for all data mining
modeling. A poor training set dooms the network, regardless of any other
work that goes into creating it. Fortunately, there are only a few things to con­
sider in choosing a good one.

Coverage of Values for All Features
The most important of these considerations is that the training set needs to
cover the full range of values for all features that the network might encounter,
including the output. In the real estate appraisal example, this means includ­
ing inexpensive houses and expensive houses, big houses and little houses,
and houses with and without garages. In general, it is a good idea to have sev­
eral examples in the training set for each value of a categorical feature and for
values throughout the ranges of ordered discrete and continuous features.

Artificial Neural Networks 233

This is true regardless of whether the features are actually used as inputs
into the network. For instance, lot size might not be chosen as an input vari­
able in the network. However, the training set should still have examples from
all different lot sizes. A network trained on smaller lot sizes (some of which
might be low priced and some high priced) is probably not going to do a good
job on mansions.

Number of Features
The number of input features affects neural networks in two ways. First, the
more features used as inputs into the network, the larger the network needs to
be, increasing the risk of overfitting and increasing the size of the training set.
Second, the more features, the longer is takes the network to converge to a set of
weights. And, with too many features, the weights are less likely to be optimal.
This variable selection problem is a common problem for statisticians. In
practice, we find that decision trees (discussed in Chapter 6) provide a good
method for choosing the best variables. Figure 7.8 shows a nice feature of SAS
Enterprise Miner. By connecting a neural network node to a decision tree
node, the neural network only uses the variables chosen by the decision tree.
An alternative method is to use intuition. Start with a handful of variables
that make sense. Experiment by trying other variables to see which ones
improve the model. In many cases, it is useful to calculate new variables that
represent particular aspects of the business problem. In the real estate exam­
ple, for instance, we might subtract the square footage of the house from the
lot size to get an idea of how large the yard is.


. 48
( 137 .)