. 52
( 137 .)


The output units compete with
each other for the output of the

The output layer is laid out like a
grid. Each unit is connected to
all the input units, but not to each

The input layer is connected to
the inputs.

Figure 7.13 The self-organizing map is a special kind of neural network that can be used
to detect clusters.

There is one more aspect to the training of the network. Not only are the
weights for the winning unit adjusted, but the weights for units in its immedi­
ate neighborhood are also adjusted to strengthen their response to the inputs.
This adjustment is controlled by a neighborliness parameter that controls the
size of the neighborhood and the amount of adjustment. Initially, the neigh­
borhood is rather large, and the adjustments are large. As the training contin­
ues, the neighborhoods and adjustments decrease in size. Neighborliness
actually has several practical effects. One is that the output layer behaves more
like a connected fabric, even though the units are not directly connected to
each other. Clusters similar to each other should be closer together than more
dissimilar clusters. More importantly, though, neighborliness allows for a
group of units to represent a single cluster. Without this neighborliness, the
network would tend to find as many clusters in the data as there are units in
the output layer”introducing bias into the cluster detection.
Artificial Neural Networks 251

The winning output
unit and its path
0.1 0.2 0.1 0.2
0.2 0.6 0.9 0.1
0.7 0.6 0.4 0.8

Figure 7.14 An SOM finds the output unit that does the best job of recognizing a particular

Typically, a SOM identifies fewer clusters than it has output units. This is
inefficient when using the network to assign new records to the clusters, since
the new inputs are fed through the network to unused units in the output
layer. To determine which units are actually used, we apply the SOM to the
validation set. The members of the validation set are fed through the network,
keeping track of the winning unit in each case. Units with no hits or with very
few hits are discarded. Eliminating these units increases the run-time perfor­
mance of the network by reducing the number of calculations needed for new
Once the final network is in place”with the output layer restricted only to
the units that identify specific clusters”it can be applied to new instances. An
252 Chapter 7

unknown instance is fed into the network and is assigned to the cluster at the
output unit with the largest weight. The network has identified clusters, but
we do not know anything about them. We will return to the problem of identi­
fying clusters a bit later.
The original SOMs used two-dimensional grids for the output layer. This
was an artifact of earlier research into recognizing features in images com­
posed of a two-dimensional array of pixel values. The output layer can really
have any structure”with neighborhoods defined in three dimensions, as a
network of hexagons, or laid out in some other fashion.

Example: Finding Clusters
A large bank is interested in increasing the number of home equity loans that

it sells, which provides an illustration of the practical use of clustering. The

bank decides that it needs to understand customers that currently have home
equity loans to determine the best strategy for increasing its market share. To
start this process, demographics are gathered on 5,000 customers who have
home equity loans and 5,000 customers who do not have them. Even though
the proportion of customers with home equity loans is less than 50 percent, it
is a good idea to have equal weights in the training set.

The data that is gathered has fields like the following:
Appraised value of house

Amount of credit available

Amount of credit granted


Marital status

Number of children

Household income

This data forms a good training set for clustering. The input values are
mapped so they all lie between “1 and +1; these are used to train an SOM. The
network identifies five clusters in the data, but it does not give any informa­
tion about the clusters. What do these clusters mean?
A common technique to compare different clusters that works particularly
well with neural network techniques is the average member technique. Find the
most average member of each of the clusters”the center of the cluster. This is
similar to the approach used for sensitivity analysis. To do this, find the aver­
age value for each feature in each cluster. Since all the features are numbers,
this is not a problem for neural networks.
For example, say that half the members of a cluster are male and half are
female, and that male maps to “1.0 and female to +1.0. The average member
for this cluster would have a value of 0.0 for this feature. In another cluster,

Artificial Neural Networks 253

there may be nine females for every male. For this cluster, the average member
would have a value of 0.8. This averaging works very well with neural net-
works since all inputs have to be mapped into a numeric range.

T I P Self-organizing maps, a type of neural network, can identify clusters but
they do not identify what makes the members of a cluster similar to each other.
A powerful technique for comparing clusters is to determine the center or
average member in each cluster. Using the test set, calculate the average value
for each feature in the data. These average values can then be displayed in the
same graph to determine the features that make a cluster unique.

These average values can then be plotted using parallel coordinates as in
Figure 7.15, which shows the centers of the five clusters identified in the bank-
ing example. In this case, the bank noted that one of the clusters was particu-
larly interesting, consisting of married customers in their forties with children.
A bit more investigation revealed that these customers also had children in
their late teens. Members of this cluster had more home equity lines than
members of other clusters.











Available Credit Age Marital Num Income
Credit Balance Status Children

This cluster looks interesting. High-income customers
with children in the middle age group who are taking
out large loans.
Figure 7.15 The centers of five clusters are compared on the same graph. This simple
visualization technique (called parallel coordinates) helps identify interesting clusters.
254 Chapter 7

The story continues with the Marketing Department of the bank concluding
that these people had taken out home equity loans to pay college tuition fees.
The department arranged a marketing program designed specifically for this
market, selling home equity loans as a means to pay for college education. The
results from this campaign were disappointing. The marketing program was
not successful.
Since the marketing program failed, it may seem as though the clusters did
not live up to their promise. In fact, the problem lay elsewhere. The bank had
initially only used general customer information. It had not combined infor­
mation from the many different systems servicing its customers. The bank
returned to the problem of identifying customers, but this time it included
more information”from the deposits system, the credit card system, and
so on.
The basic methods remained the same, so we will not go into detail about
the analysis. With the additional data, the bank discovered that the cluster of
customers with college-age children did actually exist, but a fact had been
overlooked. When the additional data was included, the bank learned that the
customers in this cluster also tended to have business accounts as well as per­
sonal accounts. This led to a new line of thinking. When the children leave
home to go to college, the parents now have the opportunity to start a new
business by taking advantage of the equity in their home.
With this insight, the bank created a new marketing program targeted at the
parents, about starting a new business in their empty nest. This program suc­
ceeded, and the bank saw improved performance from its home equity loans
group. The lesson of this case study is that, although SOMs are powerful tools
for finding clusters, neural networks really are only as good as the data that
goes into them.

Lessons Learned
Neural networks are a versatile data mining tool. Across a large number of
industries and a large number of applications, neural networks have proven
themselves over and over again. These results come in complicated domains,
such as analyzing time series and detecting fraud, that are not easily amenable
to other techniques. The largest neural network developed for production is
probably the system that AT&T developed for reading numbers on checks. This
neural network has hundreds of thousands of units organized into seven layers.
Their foundation is based on biological models of how brains work.
Although predating digital computers, the basic ideas have proven useful. In
biology, neurons fire after their inputs reach a certain threshold. This model
Artificial Neural Networks 255

can be implemented on a computer as well. The field has really taken off since
the 1980s, when statisticians started to use them and understand them better.
A neural network consists of artificial neurons connected together. Each
neuron mimics its biological counterpart, taking various inputs, combining
them, and producing an output. Since digital neurons process numbers, the
activation function characterizes the neuron. In most cases, this function takes
the weighted sum of its inputs and applies an S-shaped function to it. The
result is a node that sometimes behaves in a linear fashion, and sometimes
behaves in a nonlinear fashion”an improvement over standard statistical
The most common network is the feed-forward network for predictive mod­
eling. Although originally a breakthrough, the back propagation training
method has been replaced by other methods, notably conjugate gradient.
These networks can be used for both categorical and continuous inputs. How­
ever, neural networks learn best when input fields have been mapped to the
range between “1 and +1. This is a guideline to help train the network. Neural
networks still work when a small amount of data falls outside the range and
for more limited ranges, such as 0 to 1.
Neural networks do have several drawbacks. First, they work best when
there are only a few input variables, and the technique itself does not help
choose which variables to use. Variable selection is an issue. Other techniques,
such as decision trees can come to the rescue. Also, when training a network,
there is no guarantee that the resulting set of weights is optimal. To increase
confidence in the result, build several networks and take the best one.
Perhaps the biggest problem, though, is that a neural network cannot
explain what it is doing. Decision trees are popular because they can provide a
list of rules. There is no way to get an accurate set of rules from a neural net­


. 52
( 137 .)