ñòð. 47 
inputs
Figure 7.4 The unit of an artificial neural network is modeled on the biological neuron.
The output of the unit is a nonlinear combination of its inputs.
The second part of the activation function is the transfer function, which gets
its name from the fact that it transfers the value of the combination function to
the output of the unit. Figure 7.5 compares three typical transfer functions: the
sigmoid (logistic), linear, and hyperbolic tangent functions. The specific values
that the transfer function takes on are not as important as the general form of
the function. From our perspective, the linear transfer function is the least interÂ
esting. A feedforward neural network consisting only of units with linear
transfer functions and a weighted sum combination function is really just doing
a linear regression. Sigmoid functions are Sshaped functions, of which the two
most common for neural networks are the logistic and the hyperbolic tangent.
The major difference between them is the range of their outputs, between 0 and
1 for the logistic and between â€“1 and 1 for the hyperbolic tangent.
The logistic and hyperbolic tangent transfer functions behave in a similar
way. Even though they are not linear, their behavior is appealing to statistiÂ
cians. When the weighted sum of all the inputs is near 0, then these functions
are a close approximation of a linear function. Statisticians appreciate linear
systems, and almostlinear systems are almost as well appreciated. As the
224 Chapter 7
magnitude of the weighted sum gets larger, these transfer functions gradually
saturate (to 0 and 1 in the case of the logistic; to â€“1 and 1 in the case of the
hyperbolic tangent). This behavior corresponds to a gradual movement from a
linear model of the input to a nonlinear model. In short, neural networks have
the ability to do a good job of modeling on three types of problems: linear
problems, nearlinear problems, and nonlinear problems. There is also a relaÂ
tionship between the activation function and the range of input values, as disÂ
cussed in the sidebar, â€œSigmoid Functions and Ranges for Input Values.â€
A network can contain units with different transfer functions, a subject
weâ€™ll return to later when discussing network topology. Sophisticated tools
sometimes allow experimentation with other combination and transfer funcÂ
tions. Other functions have significantly different behavior from the standard
units. It may be fun and even helpful to play with different types of activation
functions. If you do not want to bother, though, you can have confidence in the
standard functions that have proven successful for many neural network
applications.
1.0
0.5
Sigmoid
(logistic)
0.0
0.5
Linear
1.0
0
Exponential (tanh)
Figure 7.5 Three common transfer functions are the sigmoid, linear, and hyperbolic tangent
functions.
Artificial Neural Networks 225
SIGMOID FUNCTIONS AND RANGES FOR INPUT VALUES
The sigmoid activation functions are Sshaped curves that fall within bounds.
For instance, the logistic function produces values between 0 and 1, and the
hyperbolic tangent produces values between â€“1 and 1 for all possible outputs
of the summation function. The formulas for these functions are:
logistic(x) = 1/(1 + eâ€“x)
tanh(x) = (ex â€“ eâ€“x)/(ex + eâ€“x)
When used in a neural network, the x is the result of the combination
function, typically the weighted sum of the inputs into the unit.
Since these functions are defined for all values of x, why do we recommend
that the inputs to a network be in a small range, typically from â€“1 to 1? The
reason has to do with how these functions behave near 0. In this range, they
behave in an almost linear way. That is, small changes in x result in small
changes in the output; changing x by half as much results in about half the effect
on the output. The relationship is not exact, but it is a close approximation.
For training purposes, it is a good idea to start out in this quasilinear area.
As the neural network trains, nodes may find linear relationships in the data.
These nodes adjust their weights so the resulting value falls in this linear range.
Other nodes may find nonlinear relationships. Their adjusted weights are likely
to fall in a larger range.
Requiring that all inputs be in the same range also prevents one set of
inputs, such as the price of a houseâ€”a big number in the tens of thousandsâ€”
from dominating other inputs, such as the number of bedrooms. After all, the
combination function is a weighted sum of the inputs, and when some values
are very large, they will dominate the weighted sum. When x is large, small
adjustments to the weights on the inputs have almost no effect on the output
of the unit making it difficult to train. That is, the sigmoid function can take
advantage of the difference between one and two bedrooms, but a house that
costs $50,000 and one that costs $1,000,000 would be hard for it to distinguish,
and it can take many generations of training the network for the weights
associated with this feature to adjust. Keeping the inputs relatively small
enables adjustments to the weights to have a bigger impact. This aid to training
is the strongest reason for insisting that inputs stay in a small range.
In fact, even when a feature naturally falls into a range smaller than â€“1 to 1,
such as 0.5 to 0.75, it is desirable to scale the feature so the input to the
network uses the entire range from â€“1 to 1. Using the full range of values from
â€“1 to 1 ensures the best results.
Although we recommend that inputs be in the range from â€“1 to 1, this
should be taken as a guideline, not a strict rule. For instance, standardizing
variablesâ€”subtracting the mean and dividing by the standard deviationâ€”is a
common transformation on variables. This results in small enough values to be
useful for neural networks.
226 Chapter 7
FeedForward Neural Networks
A feedforward neural network calculates output values from input values, as
shown in Figure 7.6. The topology, or structure, of this network is typical of
networks used for prediction and classification. The units are organized into
three layers. The layer on the left is connected to the inputs and called the input
layer. Each unit in the input layer is connected to exactly one source field,
which has typically been mapped to the range â€“1 to 1. In this example, the
input layer does not actually do any work. Each input layer unit copies
its input value to its output. If this is the case, why do we even bother to menÂ
tion it here? It is an important part of the vocabulary of neural networks. In
practical terms, the input layer represents the process for mapping values into
a reasonable range. For this reason alone, it is worth including them, because
they are a reminder of a very important aspect of using neural networks
successfully.
output
from unit
input
constant
weight
input
0.0000
0.5328 0.23057
0.21666
0.49728
0.3333
0.48854
0.47909
0.24754
1
Num_Apartments 0.0000 1.000 0.26228
1923
Year_Built 0.5328 0.53988
0.42183
0.53040
9
Plumbing_Fixtures 0.3333
0.53499
B
Heating_Type 1.0000 0.0000 0.35250
0.52491
Basement_Garage 0.0000
0
0.86181 0.57265
Attached_Garage 0.5263
120
0.49815
0.5263
Living_Area 0.2593
1614
0.73920
Deck_Area 0.0000
0
$176,228
0.33530
0.35789
Porch_Area 0.4646
210
0.04826
0.2593
Recroom_Area 0.0000
0 0.24434 0.58282
Basement_Area 0.2160
175 0.73107
0.22200
0.0000 0.98888

0.33192
0.76719
0.19472
0.4646
0.29771
0.00042
0.0000
0.2160
Figure 7.6 The real estate training example shown here provides the input into a feed
forward neural network and illustrates that a network is filled with seemingly meaningless
weights.
Artificial Neural Networks 227
The next layer is called the hidden layer because it is connected neither to the
inputs nor to the output of the network. Each unit in the hidden layer is
typically fully connected to all the units in the input layer. Since this network
contains standard units, the units in the hidden layer calculate their output by
multiplying the value of each input by its corresponding weight, adding these
up, and applying the transfer function. A neural network can have any numÂ
ber of hidden layers, but in general, one hidden layer is sufficient. The wider
the layer (that is, the more units it contains) the greater the capacity of the netÂ
work to recognize patterns. This greater capacity has a drawback, though,
because the neural network can memorize patternsofone in the training
examples. We want the network to generalize on the training set, not to memorize it.
To achieve this, the hidden layer should not be too wide.
Notice that the units in Figure 7.6 each have an additional input coming
down from the top. This is the constant input, sometimes called a bias, and is
always set to 1. Like other inputs, it has a weight and is included in the combiÂ
nation function. The bias acts as a global offset that helps the network better
understand patterns. The training phase adjusts the weights on constant
inputs just as it does on the other weights in the network.
The last unit on the right is the output layer because it is connected to the outÂ
put of the neural network. It is fully connected to all the units in the hidden
layer. Most of the time, the neural network is being used to calculate a single
value, so there is only one unit in the output layer and the value. We must map
this value back to understand the output. For the network in Figure 7.6, we
have to convert 0.49815 back into a value between $103,000 and $250,000. It
corresponds to $176,228, which is quite close to the actual value of $171,000. In
some implementations, the output layer uses a simple linear transfer function,
so the output is a weighted linear combination of inputs. This eliminates the
need to map the outputs.
It is possible for the output layer to have more than one unit. For instance, a
department store chain wants to predict the likelihood that customers will be
purchasing products from various departments, such as womenâ€™s apparel,
furniture, and entertainment. The stores want to use this information to plan
promotions and direct target mailings.
To make this prediction, they might set up the neural network shown in
Figure 7.7. This network has three outputs, one for each department. The outÂ
puts are a propensity for the customer described in the inputs to make his or
her next purchase from the associated department.
228 Chapter 7
last purchase
propensity to purchase
womenâ€™s apparel
age
propensity to purchase
furniture
gender
propensity to purchase
entertainment
avg balance
...
and so on
Figure 7.7 This network has with more than one output and is used to predict the
department where department store customers will make their next purchase.
ñòð. 47 