. 50
( 137 .)


The name arises because the sequence of 1s starts on one side and rises to
some value, like the mercury in a thermometer; this sequence is then inter­
preted as a decimal written in binary notation. Thermometer codes are good
for things like academic grades and bond ratings, where the difference on one
end of the scale is less significant than differences on the other end.
For instance, for many marketing applications, having no children is quite
different from having one child. However, the difference between three chil­
dren and four is rather negligible. Using a thermometer code, the number of
children variable might be mapped as follows: 0 (for 0 children), 0.5 (for one
child), 0.75 (for two children), 0.875 (for three children), and so on. For cate­
gorical variables, it is often easier to keep mapped values in the range from 0
to 1. This is reasonable. However, to extend the range from “1 to 1, double the
value and subtract 1.
Thermometer codes are one way of including prior information into the
coding scheme. They keep certain codes values close together because you
have a sense that these code values should be close. This type of knowledge
can improve the results from a neural network”don™t make it discover what
you already know. Feel free to map values onto the unit interval so that codes
close to each other match your intuitive notions of how close they should be.

Features with Categorical Values
Features with categories are unordered lists of values. These are different from
ordered lists, because there is no ordering to preserve and introducing an
order is inappropriate. There are typically many examples of categorical val­
ues in data, such as:
Gender, marital status

Status codes

Product codes

Zip codes

Although zip codes look like numbers in the United States, they really rep­
resent discrete geographic areas, and the codes themselves give little geo­
graphic information. There is no reason to think that 10014 is more like 02116
than it is like 94117, even though the numbers are much closer. The numbers
are just discrete names attached to geographical areas.
There are three fundamentally different ways of handling categorical features.
The first is to treat the codes as discrete, ordered values, mapping them using the
methods discussed in the previous section. Unfortunately, the neural network
does not understand that the codes are unordered. So, five codes for marital
status (“single,” “divorced,” “married,” “widowed,” and “unknown”) would
240 Chapter 7

be mapped to “1.0, “0.5, 0.0, +0.5, +1.0, respectively. From the perspective of the
network, “single” and “unknown” are very far apart, whereas “divorced” and
“married” are quite close. For some input fields, this implicit ordering might not
have much of an effect. In other cases, the values have some relationship to each
other and the implicit ordering confuses the network.

WA R N I N G When working with categorical variables in neural networks, be
very careful when mapping the variables to numbers. The mapping introduces
an ordering of the variables, which the neural network takes into account, even
if the ordering does not make any sense.

The second way of handling categorical features is to break the categories
into flags, one for each value. Assume that there are three values for gender
(male, female, and unknown). Table 7.3 shows how three flags can be used to
code these values using a method called 1 of N Coding. It is possible to reduce
the number of flags by eliminated the flag for the unknown gender; this
approach is called 1 of N “ 1 Coding.
Why would we want to do this? We have now multiplied the number of
input variables and this is generally a bad thing for a neural network. How­
ever, these coding schemes are the only way to eliminate implicit ordering
among the values.
The third way is to replace the code itself with numerical data about the
code. Instead of including zip codes in a model, for instance, include various
census fields, such as the median income or the proportion of households with
children. Another possibility is to include historical information summarized
at the level of the categorical variable. An example would be including the his­
torical churn rate by zip code for a model that is predicting churn.

T I P When using categorical variables in a neural network, try to replace them
with some numeric variable that describes them, such as the average income in
a census block, the proportion of customers in a zip code (penetration), the
historical churn rate for a handset, or the base cost of a pricing plan.

Table 7.3 Handling Gender Using 1 of N Coding and 1 of N - 1 Coding


Male +1.0 -1.0 -1.0 +1.0 -1.0

Female -1.0 +1.0 -1.0 -1.0 +1.0

Unknown -1.0 -1.0 +1.0 -1.0 -1.0
Artificial Neural Networks 241

Other Types of Features
Some input features might not fit directly into any of these three categories.
For complicated features, it is necessary to extract meaningful information and
use one of the above techniques to represent the result. Remember, the input to
a neural network consists of inputs whose values should generally fall
between “1 and 1.
Dates are a good example of data that you may want to handle in special
ways. Any date or time can be represented as the number of days or seconds
since a fixed point in time, allowing them to be mapped and fed directly into
the network. However, if the date is for a transaction, then the day of the week
and month of the year may be more important than the actual date. For
instance, the month would be important for detecting seasonal trends in data.
You might want to extract this information from the date and feed it into the
network instead of, or in addition to, the actual date.
The address field”or any text field”is similarly complicated. Generally,
addresses are useless to feed into a network, even if you could figure out a
good way to map the entire field into a single value. However, the address
may contain a zip code, city name, state, and apartment number. All of these
may be useful features, even though the address field taken as a whole is
usually useless.

Interpreting the Results
Neural network tools take the work out of interpreting the results. When esti­
mating a continuous value, often the output needs to be scaled back to the cor­
rect range. For instance, the network might be used to calculate the value of a
house and, in the training set, the output value is set up so that $103,000 maps
to “1 and $250,000 maps to 1. If the model is later applied to another house and
the output is 0.0, then we can figure out that this corresponds to $176,500”
halfway between the minimum and the maximum values. This inverse trans­
formation makes neural networks particularly easy to use for estimating
continuous values. Often, though, this step is not necessary, particularly when
the output layer is using a linear transfer function.
For binary or categorical output variables, the approach is still to take the
inverse of the transformation used for training the network. So, if “churn” is
given a value of 1 and “no-churn” a value of “1, then values near 1 represent
churn, and those near “1 represent no churn. When there are two outcomes,
the meaning of the output depends on the training set used to train the
network. Because the network learns to minimize the error, the average value
produced by the network during training is usually going to be close to the
average value in the training set. One way to think of this is that the first
242 Chapter 7

pattern the network finds is the average value. So, if the original training set
had 50 percent churn and 50 percent no-churn, then the average value the net­
work will produce for the training set examples is going to be close to 0.0. Val­
ues higher than 0.0 are more like churn and those less than 0.0, less like churn.
If the original training set had 10 percent churn, then the cutoff would more
reasonably be “0.8 rather than 0.0 (“0.8 is 10 percent of the way from “1 to 1).
So, the output of the network does look a lot like a probability in this case.
However, the probability depends on the distribution of the output variable in
the training set.
Yet another approach is to assign a confidence level along with the value.
This confidence level would treat the actual output of the network as a propen­
sity to churn, as shown in Table 7.4.
For binary values, it is also possible to create a network that produces two

outputs, one for each value. In this case, each output represents the strength of

evidence that that category is the correct one. The chosen category would then
be the one with the higher value, with confidence based on some function of
the strengths of the two outputs. This approach is particularly valuable when
the two outcomes are not exclusive.

TI P Because neural networks produce continuous values, the output from a

network can be difficult to interpret for categorical results (used in classification).
The best way to calibrate the output is to run the network over a validation set,
entirely separate from the training set, and to use the results from the validation
set to calibrate the output of the network to categories. In many cases, the
network can have a separate output for each category; that is, a propensity for
that category. Even with separate outputs, the validation set is still needed to
calibrate the outputs.

Table 7.4 Categories and Confidence Levels for NN Output


“1.0 A 100%

“0.6 A 80%

“0.02 A 51%

+0.02 B 51%
+0.6 B 80%

+1.0 B 100%

Artificial Neural Networks 243

The approach is similar when there are more than two options under con­
sideration. For example, consider a long distance carrier trying to target a new
set of customers with three targeted service offerings:
Discounts on all international calls

Discounts on all long-distance calls that are not international

Discounts on calls to a predefined set of customers

The carrier is going to offer incentives to customers for each of the three
packages. Since the incentives are expensive, the carrier needs to choose the
right service for the right customers in order for the campaign to be profitable.
Offering all three products to all the customers is expensive and, even worse,
may confuse the recipients, reducing the response rate.
The carrier test markets the products to a small subset of customers who
receive all three offers but are only allowed to respond to one of them. It
intends to use this information to build a model for predicting customer affin­
ity for each offer. The training set uses the data collected from the test market­
ing campaign, and codes the propensity as follows: no response ’ “1.00,
international ’ “0.33, national ’ +0.33, and specific numbers ’ +1.00. After
training a neural network with information about the customers, the carrier
starts applying the model.
But, applying the model does not go as well as planned. Many customers
cluster around the four values used for training the network. However, apart
from the nonresponders (who are the majority), there are many instances
when the network returns intermediate values like 0.0 and 0.5. What can be
First, the carrier should use a validation set to understand the output values.
By interpreting the results of the network based on what happens in the
validation set, it can find the right ranges to use for transforming the results of
the network back into marketing segments. This is the same process shown in
Figure 7.11.
Another observation in this case is that the network is really being used to
predict three different things, whether a recipient will respond to each of the
campaigns. This strongly suggests that a better structure for the network is to
have three outputs: a propensity to respond to the international plan, to the
long-distance plan, and to the specific numbers plan. The test set would then
be used to determine where the cutoff is for nonrespondents. Alternatively,
each outcome could be modeled separately, and the model results combined to
select the appropriate campaign.
244 Chapter 7







. 50
( 137 .)