. 49
( 137 .)


Figure 7.8 SAS Enterprise Miner provides a simple mechanism for choosing variables for
a neural network”just connect a neural network node to a decision tree node.
234 Chapter 7

Size of Training Set
The more features there are in the network, the more training examples that
are needed to get a good coverage of patterns in the data. Unfortunately, there
is no simple rule to express a relationship between the number of features and
the size of the training set. However, typically a minimum of a few hundred
examples are needed to support each feature with adequate coverage; having
several thousand is not unreasonable. The authors have worked with neural
networks that have only six or seven inputs, but whose training set contained
hundreds of thousands of rows.
When the training set is not sufficiently large, neural networks tend to over-
fit the data. Overfitting is guaranteed to happen when there are fewer training
examples than there are weights in the network. This poses a problem, because
the network will work very, very well on the training set, but it will fail spec­
tacularly on unseen data.
Of course, the downside of a really large training set is that it takes the neural
network longer to train. In a given amount of time, you may get better models
by using fewer input features and a smaller training set and experimenting
with different combinations of features and network topologies rather than
using the largest possible training set that leaves no time for experimentation.

Number of Outputs
In most training examples, there are typically many more inputs going in than
there are outputs going out, so good coverage of the inputs results in good
coverage of the outputs. However, it is very important that there be many
examples for all possible output values from the network. In addition, the
number of training examples for each possible output should be about the
same. This can be critical when deciding what to use as the training set.
For instance, if the neural network is going to be used to detect rare, but
important events”failure rates in a diesel engines, fraudulent use of a credit
card, or who will respond to an offer for a home equity line of credit”then the
training set must have a sufficient number of examples of these rare events. A
random sample of available data may not be sufficient, since common exam­
ples will swamp the rare examples. To get around this, the training set needs
to be balanced by oversampling the rare cases. For this type of problem, a
training set consisting of 10,000 “good” examples and 10,000 “bad” examples
gives better results than a randomly selected training set of 100,000 good
examples and 1,000 bad examples. After all, using the randomly sampled
training set the neural network would probably assign “good” regardless of
the input”and be right 99 percent of the time. This is an exception to the gen­
eral rule that a larger training set is better.
Artificial Neural Networks 235

T I P The training set for a neural network has to be large enough to cover all
the values taken on by all the features. You want to have at least a dozen, if not
hundreds or thousands, of examples for each input feature. For the outputs of
the network, you want to be sure that there is an even distribution of values.
This is a case where fewer examples in the training set can actually improve
results, by not swamping the network with “good” examples when you want to
train it to recognize “bad” examples. The size of the training set is also
influenced by the power of the machine running the model. A neural network
needs more time to train when the training set is very large. That time could
perhaps better be used to experiment with different features, input mapping
functions, and parameters of the network.

Preparing the Data
Preparing the input data is often the most complicated part of using a neural
network. Part of the complication is the normal problem of choosing the right
data and the right examples for a data mining endeavor. Another part is
mapping each field to an appropriate range”remember, using a limited range
of inputs helps networks better recognize patterns. Some neural network
packages facilitate this translation using friendly, graphical interfaces. Since
the format of the data going into the network has a big effect on how well
the network performs, we are reviewing the common ways to map data.
Chapter 17 contains additional material on data preparation.

Features with Continuous Values
Some features take on continuous values, generally ranging between known
minimum and maximum bounds. Examples of such features are:
Dollar amounts (sales price, monthly balance, weekly sales, income,

and so on)
Averages (average monthly balance, average sales volume, and so on)

Ratios (debt-to-income, price-to-earnings, and so on)

Physical measurements (area of living space, temperature, and so on)

The real estate appraisal example showed a good way to handle continuous
features. When these features fall into a predefined range between a minimum
value and a maximum value, the values can be scaled to be in a reasonable
range, using a calculation such as:
mapped_value = 2 * (original_value “ min) / (max “ min + 1) “ 1
236 Chapter 7

This transformation (subtract the min, divide by the range, double and
subtract 1) produces a value in the range from “1 to 1 that follows the same
distribution as the original value. This works well in many cases, but there are
some additional considerations.
The first is that the range a variable takes in the training set may be different
from the range in the data being scored. Of course, we try to avoid this situa­
tion by ensuring that all variables values are represented in the training set.
However, this ideal situation is not always possible. Someone could build a
new house in the neighborhood with 5,000 square feet of living space perhaps
rendering the real estate appraisal network useless. There are several ways to
approach this:
Plan for a larger range. The range of living areas in the training set was

from 714 square feet to 4185 square feet. Instead of using these values
for the minimum and maximum value of the range, allow for some
growth, using, say, 500 and 5000 instead.
Reject out-of-range values. Once we start extrapolating beyond the

ranges of values in the training set, we have much less confidence in the
results. Only use the network for predefined ranges of input values.
This is particularly important when using a network for controlling a
manufacturing process; wildly incorrect results can lead to disasters.
Peg values lower than the minimum to the minimum and higher than

the maximum to the maximum. So, houses larger than 4,000 square feet
would all be treated the same. This works well in many situations. How­
ever, we suspect that the price of a house is highly correlated with the
living area. So, a house with 20 percent more living area than the maxi­
mum house size (all other things being equal) would cost about 20 per­
cent more. In other situations, pegging the values can work quite well.
Map the minimum value to “0.9 and the maximum value to 0.9 instead

of “1 and 1.
Or, most likely, don™t worry about it. It is important that most values are

near 0; a few exceptions probably will not have a significant impact.
Figure 7.9 illustrates another problem that arises with continuous features”
skewed distribution of values. In this data, almost all incomes are under
$100,000, but the range goes from $10,000 to $1,000,000. Scaling the values as
suggested maps a $30,000 income to “0.96 and a $65,000 income to “0.89,
hardly any difference at all, although this income differential might be very
significant for a marketing application. On the other hand, $250,000 and
$800,000 become “0.51 and +0.60, respectively”a very large difference,
though this income differential might be much less significant. The incomes
are highly skewed toward the low end, and this can make it difficult for the
neural network to take advantage of the income field. Skewed distributions
Artificial Neural Networks 237

can prevent a network from effectively using an important field. Skewed dis­
tributions affect neural networks but not decision trees because neural net­
works actually use the values for calculations; decision trees only use the
ordering (rank) of the values.
There are several ways to resolve this. The most common is to split a feature
like income into ranges. This is called discretizing or binning the field. Figure 7.9
illustrates breaking the incomes into 10 equal-width ranges, but this is not use­
ful at all. Virtually all the values fall in the first two ranges. Equal-sized quin­
tiles provide a better choice of ranges:
$10,000“$17,999 very low (“1.0)
$18,000“$31,999 low (“0.5)
$32,000“$63,999 middle (0.0)
$64,000“$99,999 high (+0.5)
$100,000 and above very high (+1.0)
Information is being lost by this transformation. A household with an
income of $65,000 now looks exactly like a household with an income of
$98,000. On the other hand, the sheer magnitude of the larger values does not
confuse the neural network.
There are other methods as well. For instance, taking the logarithm is a good
way of handling values that have wide ranges. Another approach is to stan­
dardize the variable, by subtracting the mean and dividing by the standard
deviation. The standardized value is going to very often be between “2 and +2
(that is, for most variables, almost all values fall within two standard devia­
tions of the mean). Standardizing variables is often a good approach for neural
networks. However, it must be used with care, since big outliers make the
standard deviation big. So, when there are big outliers, many of the standard­
ized values will fall into a very small range, making it hard for the network to
distinguish them from each other.


number of people






region 1 region 2 region 3 region 4 region 5 region 6 region 7 region 8 region 9 region 10
0 $100,000 $200,000 $300,000 $400,000 $500,000 $600,000 $700,000 $800,000 $900,000 $1,000,000

Figure 7.9 Household income provides an example of a skewed distribution. Almost all
the values are in the first 10 percent of the range (income of less than $100,000).
238 Chapter 7

Features with Ordered, Discrete (Integer) Values
Continuous features can be binned into ordered, discrete values. Other exam­
ples of features with ordered values include:
Counts (number of children, number of items purchased, months since

sale, and so on)

Ordered categories (low, medium, high)

Like the continuous features, these have a maximum and minimum value.
For instance, age usually ranges from 0 to about 100, but the exact range may
depend on the data used. The number of children may go from 0 to 4, with any­
thing over 4 considered to be 4. Preparing such fields is simple. First, count the
number of different values and assign each a proportional fraction in some
range, say from 0 to 1. For instance, if there are five distinct values, then these get
mapped to 0, 0.25, 0.50, 0.75, and 1, as shown in Figure 7.10. Notice that mapping
the values onto the unit interval like this preserves the ordering; this is an impor­
tant aspect of this method and means that information is not being lost.
It is also possible to break a range into unequal parts. One example is called
thermometer codes:
’ 0000
0 = 0/16 = 0.0000
’ 1000
1 = 8/16 = 0.5000
’ 1100
2 = 12/16 = 0.7500
’ 1110
3 = 14/16 = 0.8750

Number of Children

-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
No 4 or more
1 child 2 children 3 children
children children

Figure 7.10 When codes have an inherent order, they can be mapped onto the unit interval.
Artificial Neural Networks 239


. 49
( 137 .)