some value, like the mercury in a thermometer; this sequence is then inter

preted as a decimal written in binary notation. Thermometer codes are good

for things like academic grades and bond ratings, where the difference on one

end of the scale is less significant than differences on the other end.

For instance, for many marketing applications, having no children is quite

different from having one child. However, the difference between three chil

dren and four is rather negligible. Using a thermometer code, the number of

children variable might be mapped as follows: 0 (for 0 children), 0.5 (for one

child), 0.75 (for two children), 0.875 (for three children), and so on. For cate

gorical variables, it is often easier to keep mapped values in the range from 0

to 1. This is reasonable. However, to extend the range from “1 to 1, double the

value and subtract 1.

Thermometer codes are one way of including prior information into the

coding scheme. They keep certain codes values close together because you

have a sense that these code values should be close. This type of knowledge

can improve the results from a neural network”don™t make it discover what

you already know. Feel free to map values onto the unit interval so that codes

close to each other match your intuitive notions of how close they should be.

Features with Categorical Values

Features with categories are unordered lists of values. These are different from

ordered lists, because there is no ordering to preserve and introducing an

order is inappropriate. There are typically many examples of categorical val

ues in data, such as:

Gender, marital status

––

Status codes

––

Product codes

––

Zip codes

––

Although zip codes look like numbers in the United States, they really rep

resent discrete geographic areas, and the codes themselves give little geo

graphic information. There is no reason to think that 10014 is more like 02116

than it is like 94117, even though the numbers are much closer. The numbers

are just discrete names attached to geographical areas.

There are three fundamentally different ways of handling categorical features.

The first is to treat the codes as discrete, ordered values, mapping them using the

methods discussed in the previous section. Unfortunately, the neural network

does not understand that the codes are unordered. So, five codes for marital

status (“single,” “divorced,” “married,” “widowed,” and “unknown”) would

240 Chapter 7

be mapped to “1.0, “0.5, 0.0, +0.5, +1.0, respectively. From the perspective of the

network, “single” and “unknown” are very far apart, whereas “divorced” and

“married” are quite close. For some input fields, this implicit ordering might not

have much of an effect. In other cases, the values have some relationship to each

other and the implicit ordering confuses the network.

WA R N I N G When working with categorical variables in neural networks, be

very careful when mapping the variables to numbers. The mapping introduces

an ordering of the variables, which the neural network takes into account, even

if the ordering does not make any sense.

The second way of handling categorical features is to break the categories

into flags, one for each value. Assume that there are three values for gender

(male, female, and unknown). Table 7.3 shows how three flags can be used to

code these values using a method called 1 of N Coding. It is possible to reduce

the number of flags by eliminated the flag for the unknown gender; this

approach is called 1 of N “ 1 Coding.

Why would we want to do this? We have now multiplied the number of

input variables and this is generally a bad thing for a neural network. How

ever, these coding schemes are the only way to eliminate implicit ordering

among the values.

The third way is to replace the code itself with numerical data about the

code. Instead of including zip codes in a model, for instance, include various

census fields, such as the median income or the proportion of households with

children. Another possibility is to include historical information summarized

at the level of the categorical variable. An example would be including the his

torical churn rate by zip code for a model that is predicting churn.

T I P When using categorical variables in a neural network, try to replace them

with some numeric variable that describes them, such as the average income in

a census block, the proportion of customers in a zip code (penetration), the

historical churn rate for a handset, or the base cost of a pricing plan.

Table 7.3 Handling Gender Using 1 of N Coding and 1 of N - 1 Coding

N CODING N - 1 CODING

GENDER GENDER GENDER GENDER GENDER

MALE FEMALE UNKNOWN MALE FEMALE

GENDER FLAG FLAG FLAG FLAG FLAG

Male +1.0 -1.0 -1.0 +1.0 -1.0

Female -1.0 +1.0 -1.0 -1.0 +1.0

Unknown -1.0 -1.0 +1.0 -1.0 -1.0

Artificial Neural Networks 241

Other Types of Features

Some input features might not fit directly into any of these three categories.

For complicated features, it is necessary to extract meaningful information and

use one of the above techniques to represent the result. Remember, the input to

a neural network consists of inputs whose values should generally fall

between “1 and 1.

Dates are a good example of data that you may want to handle in special

ways. Any date or time can be represented as the number of days or seconds

since a fixed point in time, allowing them to be mapped and fed directly into

the network. However, if the date is for a transaction, then the day of the week

and month of the year may be more important than the actual date. For

instance, the month would be important for detecting seasonal trends in data.

You might want to extract this information from the date and feed it into the

network instead of, or in addition to, the actual date.

The address field”or any text field”is similarly complicated. Generally,

addresses are useless to feed into a network, even if you could figure out a

good way to map the entire field into a single value. However, the address

may contain a zip code, city name, state, and apartment number. All of these

may be useful features, even though the address field taken as a whole is

usually useless.

Interpreting the Results

Neural network tools take the work out of interpreting the results. When esti

mating a continuous value, often the output needs to be scaled back to the cor

rect range. For instance, the network might be used to calculate the value of a

house and, in the training set, the output value is set up so that $103,000 maps

to “1 and $250,000 maps to 1. If the model is later applied to another house and

the output is 0.0, then we can figure out that this corresponds to $176,500”

halfway between the minimum and the maximum values. This inverse trans

formation makes neural networks particularly easy to use for estimating

continuous values. Often, though, this step is not necessary, particularly when

the output layer is using a linear transfer function.

For binary or categorical output variables, the approach is still to take the

inverse of the transformation used for training the network. So, if “churn” is

given a value of 1 and “no-churn” a value of “1, then values near 1 represent

churn, and those near “1 represent no churn. When there are two outcomes,

the meaning of the output depends on the training set used to train the

network. Because the network learns to minimize the error, the average value

produced by the network during training is usually going to be close to the

average value in the training set. One way to think of this is that the first

242 Chapter 7

pattern the network finds is the average value. So, if the original training set

had 50 percent churn and 50 percent no-churn, then the average value the net

work will produce for the training set examples is going to be close to 0.0. Val

ues higher than 0.0 are more like churn and those less than 0.0, less like churn.

If the original training set had 10 percent churn, then the cutoff would more

reasonably be “0.8 rather than 0.0 (“0.8 is 10 percent of the way from “1 to 1).

So, the output of the network does look a lot like a probability in this case.

However, the probability depends on the distribution of the output variable in

the training set.

Yet another approach is to assign a confidence level along with the value.

This confidence level would treat the actual output of the network as a propen

sity to churn, as shown in Table 7.4.

For binary values, it is also possible to create a network that produces two

Y

outputs, one for each value. In this case, each output represents the strength of

FL

evidence that that category is the correct one. The chosen category would then

be the one with the higher value, with confidence based on some function of

the strengths of the two outputs. This approach is particularly valuable when

AM

the two outcomes are not exclusive.

TI P Because neural networks produce continuous values, the output from a

TE

network can be difficult to interpret for categorical results (used in classification).

The best way to calibrate the output is to run the network over a validation set,

entirely separate from the training set, and to use the results from the validation

set to calibrate the output of the network to categories. In many cases, the

network can have a separate output for each category; that is, a propensity for

that category. Even with separate outputs, the validation set is still needed to

calibrate the outputs.

Table 7.4 Categories and Confidence Levels for NN Output

OUTPUT VALUE CATEGORY CONFIDENCE

“1.0 A 100%

“0.6 A 80%

“0.02 A 51%

+0.02 B 51%

+0.6 B 80%

+1.0 B 100%

Team-Fly®

Artificial Neural Networks 243

The approach is similar when there are more than two options under con

sideration. For example, consider a long distance carrier trying to target a new

set of customers with three targeted service offerings:

Discounts on all international calls

––

Discounts on all long-distance calls that are not international

––

Discounts on calls to a predefined set of customers

––

The carrier is going to offer incentives to customers for each of the three

packages. Since the incentives are expensive, the carrier needs to choose the

right service for the right customers in order for the campaign to be profitable.

Offering all three products to all the customers is expensive and, even worse,

may confuse the recipients, reducing the response rate.

The carrier test markets the products to a small subset of customers who

receive all three offers but are only allowed to respond to one of them. It

intends to use this information to build a model for predicting customer affin

ity for each offer. The training set uses the data collected from the test market

ing campaign, and codes the propensity as follows: no response ’ “1.00,

international ’ “0.33, national ’ +0.33, and specific numbers ’ +1.00. After

training a neural network with information about the customers, the carrier

starts applying the model.

But, applying the model does not go as well as planned. Many customers

cluster around the four values used for training the network. However, apart

from the nonresponders (who are the majority), there are many instances

when the network returns intermediate values like 0.0 and 0.5. What can be

done?

First, the carrier should use a validation set to understand the output values.

By interpreting the results of the network based on what happens in the

validation set, it can find the right ranges to use for transforming the results of

the network back into marketing segments. This is the same process shown in

Figure 7.11.

Another observation in this case is that the network is really being used to

predict three different things, whether a recipient will respond to each of the

campaigns. This strongly suggests that a better structure for the network is to

have three outputs: a propensity to respond to the international plan, to the

long-distance plan, and to the specific numbers plan. The test set would then

be used to determine where the cutoff is for nonrespondents. Alternatively,

each outcome could be modeled separately, and the model results combined to

select the appropriate campaign.

244 Chapter 7

1.0

B

B

B

B

A