tribute equally when the distance between two records is computed.

We suggest going farther. The whole point of automatic cluster detection is

to find clusters that make sense to you. If, for your purposes, whether people

have children is much more important than the number of credit cards they

carry, there is no reason not to bias the outcome of the clustering by multiply

ing the number of children field by a higher weight than the number of credit

cards field. After scaling to get rid of bias that is due to the units, use weights

to introduce bias based on knowledge of the business context.

Some clustering tools allow the user to attach weights to different dimen

sions, simplifying the process. Even for tools that don™t have such functionality,

it is possible to have weights by adjusting the scaled values. That is, first scale

the values to a common range to eliminate range effects. Then multiply the

resulting values by a weight to introduce bias based on the business context.

Of course, if you want to evaluate the effects of different weighting strate

gies, you will have to add another outer loop to the clustering process.

Other Approaches to Cluster Detection

The basic K-means algorithm has many variations. Many commercial software

tools that include automatic cluster detection incorporate some of these varia

tions. Among the differences are alternate methods of choosing the initial

seeds and the use of probability density rather than distance to associate

records with clusters. This last variation merits additional discussion. In addi

tion, there are several different approaches to clustering, including agglomer

ative clustering, divisive clustering, and self organizing maps.

Gaussian Mixture Models

The K-means method as described has some drawbacks:

It does not do well with overlapping clusters.

––

The clusters are easily pulled off-center by outliers.

––

Each record is either inside or outside of a given cluster.

––

366 Chapter 11

Gaussian mixture models are a probabilistic variant of K-means. The name

comes from the Gaussian distribution, a probability distribution often

assumed for high-dimensional problems. The Gaussian distribution general

izes the normal distribution to more than one variable. As before, the algo

rithm starts by choosing K seeds. This time, however, the seeds are considered

to be the means of Gaussian distributions. The algorithm proceeds by iterating

over two steps called the estimation step and the maximization step.

The estimation step calculates the responsibility that each Gaussian has for

each data point (see Figure 11.8). Each Gaussian has strong responsibility

for points that are close to its mean and weak responsibility for points that are

distant. The responsibilities are be used as weights in the next step.

In the maximization step, a new centroid is calculated for each cluster

taking into account the newly calculated responsibilities. The centroid for a

given Gaussian is calculated by averaging all the points weighted by the respon

sibilities for that Gaussian, as illustrated in Figure 11.9.

G2

X2

G1

G3

X1

Figure 11.8 In the estimation step, each Gaussian is assigned some responsibility for each

point. Thicker lines indicate greater responsibility.

Automatic Cluster Detection 367

These steps are repeated until the Gaussians no longer move. The Gaussians

themselves can change in shape as well as move. However, each Gaussian is

constrained, so if it shows a very high responsibility for points close to its mean,

then there is a sharp drop off in responsibilities. If the Gaussian covers a larger

range of values, then it has smaller responsibilities for nearby points. Since the

distribution must always integrate to one, Gaussians always gets weaker as

they get bigger.

The reason this is called a “mixture model” is that the probability at each

data point is the sum of a mixture of several distributions. At the end of the

process, each point is tied to the various clusters with higher or lower proba

bility. This is sometimes called soft clustering, because points are not uniquely

identified with a single cluster.

One consequence of this method is that some points may have high proba

bilities of being in more than one cluster. Other points may have only very low

probabilities of being in any cluster. Each point can be assigned to the cluster

where its probability is highest, turning this soft clustering into hard clustering.

X2

X1

Figure 11.9 Each Gaussian mean is moved to the centroid of all the data points weighted

by its responsibilities for each point. Thicker arrows indicate higher weights.

368 Chapter 11

Agglomerative Clustering

The K-means approach to clustering starts out with a fixed number of clusters

and allocates all records into exactly that number of clusters. Another class of

methods works by agglomeration. These methods start out with each data point

forming its own cluster and gradually merge them into larger and larger clusters

until all points have been gathered together into one big cluster. Toward the

beginning of the process, the clusters are very small and very pure”the members

of each cluster are few and closely related. Towards the end of the process, the

clusters are large and not as well defined. The entire history is preserved making

it possible to choose the level of clustering that works best for a given application.

An Agglomerative Clustering Algorithm

The first step is to create a similarity matrix. The similarity matrix is a table of

all the pair-wise distances or degrees of similarity between clusters. Initially,

the similarity matrix contains the pair-wise distance between individual pairs

of records. As discussed earlier, there are many measures of similarity between

records, including the Euclidean distance, the angle between vectors, and the

ratio of matching to nonmatching categorical fields. The issues raised by the

choice of distance measures are exactly the same as those previously discussed

in relation to the K-means approach.

It might seem that with N initial clusters for N data points, N2 measurement

calculations are required to create the distance table. If the similarity measure

is a true distance metric, only half that is needed because all true distance met

rics follow the rule that Distance(X,Y) = Distance(Y,X). In the vocabulary of

mathematics, the similarity matrix is lower triangular. The next step is to find

the smallest value in the similarity matrix. This identifies the two clusters that

are most similar to one another. Merge these two clusters into a new one and

update the similarity matrix by replacing the two rows that described the par

ent cluster with a new row that describes the distance between the merged

cluster and the remaining clusters. There are now N “ 1 clusters and N “ 1 rows

in the similarity matrix.

Repeat the merge step N “ 1 times, so all records belong to the same large

cluster. Each iteration remembers which clusters were merged and the dis

tance between them. This information is used to decide which level of cluster

ing to make use of.

Distance between Clusters

A bit more needs to be said about how to measure distance between clusters.

On the first trip through the merge step, the clusters consist of single records

so the distance between clusters is the same as the distance between records, a

subject that has already been covered in perhaps too much detail. Second and

Automatic Cluster Detection 369

subsequent trips through the loop need to update the similarity matrix with

the distances from the new, multirecord cluster to all the others. How do we

measure this distance?

As usual, there is a choice of approaches. Three common ones are:

Single linkage

––

Complete linkage

––

Centroid distance

––

In the single linkage method, the distance between two clusters is given by the

distance between the closest members. This method produces clusters with the

property that every member of a cluster is more closely related to at least one

member of its cluster than to any point outside it.

Another approach is the complete linkage method, where the distance between

two clusters is given by the distance between their most distant members. This

method produces clusters with the property that all members lie within some

known maximum distance of one another.

Third method is the centroid distance, where the distance between two clusters

is measured between the centroids of each. The centroid of a cluster is its average

element. Figure 11.10 gives a pictorial representation of these three methods.

Closest clusters by

centroid method

C3

C1

Closest clusters by

X2

complete linkage method

Closest clusters by

C2

single linkage method

X1

Figure 11.10 Three methods of measuring the distance between clusters.

370 Chapter 11

Clusters and Trees

The agglomeration algorithm creates hierarchical clusters. At each level in the

hierarchy, clusters are formed from the union of two clusters at the next level

down. A good way of visualizing these clusters is as a tree. Of course, such a

tree may look like the decision trees discussed in Chapter 6, but there are some

important differences. The most important is that the nodes of the cluster tree

do not embed rules describing why the clustering takes place; the nodes sim

ply state the fact that the two children have the minimum distance of all pos

sible clusters pairs. Another difference is that a decision tree is created to

maximize the leaf purity of a given target variable. There is no target for the

cluster tree, other than self-similarity within each cluster. Later in this chapter

we™ll discuss divisive clustering methods. These are similar to the agglomera

tive methods, except that agglomerative methods are build by starting from

the leaving and working towards the root whereas divisive methods start at

the root and work down to the leaves.

Clustering People by Age: An Example of

Agglomerative Clustering

This illustration of agglomerative clustering uses an example in one dimen

sion with the single linkage measure for distance between clusters. These

choices make it possible to follow the algorithm through all its iterations with

out having to worry about calculating distances using squares and square

roots.

The data consists of the ages of people at a family gathering. The goal is to

cluster the participants using their age, and the metric for the distance between

two people is simply the difference in their ages. The metric for the distance

between two clusters of people is the difference in age between the oldest

member of the younger cluster and the youngest member of the older cluster.

(The one dimensional version of the single linkage measure.)

Because the distances are so easy to calculate, the example dispenses with

the similarity matrix. The procedure is to sort the participants by age, then

begin clustering by first merging clusters that are 1 year apart, then 2 years,

and so on until there is only one big cluster.

Figure 11.11 shows the clusters after six iterations, with three clusters

remaining. This is the level of clustering that seems the most useful. The

algorithm appears to have clustered the population into three generations:

children, parents, and grandparents.

Automatic Cluster Detection 371

Distance 6

Distance 5