. 73
( 137 .)


zero and a variance (and standard deviation) of one. That way, all fields con­
tribute equally when the distance between two records is computed.
We suggest going farther. The whole point of automatic cluster detection is
to find clusters that make sense to you. If, for your purposes, whether people
have children is much more important than the number of credit cards they
carry, there is no reason not to bias the outcome of the clustering by multiply­
ing the number of children field by a higher weight than the number of credit
cards field. After scaling to get rid of bias that is due to the units, use weights
to introduce bias based on knowledge of the business context.
Some clustering tools allow the user to attach weights to different dimen­
sions, simplifying the process. Even for tools that don™t have such functionality,
it is possible to have weights by adjusting the scaled values. That is, first scale
the values to a common range to eliminate range effects. Then multiply the
resulting values by a weight to introduce bias based on the business context.
Of course, if you want to evaluate the effects of different weighting strate­
gies, you will have to add another outer loop to the clustering process.

Other Approaches to Cluster Detection
The basic K-means algorithm has many variations. Many commercial software
tools that include automatic cluster detection incorporate some of these varia­
tions. Among the differences are alternate methods of choosing the initial
seeds and the use of probability density rather than distance to associate
records with clusters. This last variation merits additional discussion. In addi­
tion, there are several different approaches to clustering, including agglomer­
ative clustering, divisive clustering, and self organizing maps.

Gaussian Mixture Models
The K-means method as described has some drawbacks:
It does not do well with overlapping clusters.


The clusters are easily pulled off-center by outliers.


Each record is either inside or outside of a given cluster.

366 Chapter 11

Gaussian mixture models are a probabilistic variant of K-means. The name
comes from the Gaussian distribution, a probability distribution often
assumed for high-dimensional problems. The Gaussian distribution general­
izes the normal distribution to more than one variable. As before, the algo­
rithm starts by choosing K seeds. This time, however, the seeds are considered
to be the means of Gaussian distributions. The algorithm proceeds by iterating
over two steps called the estimation step and the maximization step.
The estimation step calculates the responsibility that each Gaussian has for
each data point (see Figure 11.8). Each Gaussian has strong responsibility
for points that are close to its mean and weak responsibility for points that are
distant. The responsibilities are be used as weights in the next step.
In the maximization step, a new centroid is calculated for each cluster
taking into account the newly calculated responsibilities. The centroid for a
given Gaussian is calculated by averaging all the points weighted by the respon­
sibilities for that Gaussian, as illustrated in Figure 11.9.





Figure 11.8 In the estimation step, each Gaussian is assigned some responsibility for each
point. Thicker lines indicate greater responsibility.
Automatic Cluster Detection 367

These steps are repeated until the Gaussians no longer move. The Gaussians
themselves can change in shape as well as move. However, each Gaussian is
constrained, so if it shows a very high responsibility for points close to its mean,
then there is a sharp drop off in responsibilities. If the Gaussian covers a larger
range of values, then it has smaller responsibilities for nearby points. Since the
distribution must always integrate to one, Gaussians always gets weaker as
they get bigger.
The reason this is called a “mixture model” is that the probability at each
data point is the sum of a mixture of several distributions. At the end of the
process, each point is tied to the various clusters with higher or lower proba­
bility. This is sometimes called soft clustering, because points are not uniquely
identified with a single cluster.
One consequence of this method is that some points may have high proba­
bilities of being in more than one cluster. Other points may have only very low
probabilities of being in any cluster. Each point can be assigned to the cluster
where its probability is highest, turning this soft clustering into hard clustering.


Figure 11.9 Each Gaussian mean is moved to the centroid of all the data points weighted
by its responsibilities for each point. Thicker arrows indicate higher weights.
368 Chapter 11

Agglomerative Clustering
The K-means approach to clustering starts out with a fixed number of clusters
and allocates all records into exactly that number of clusters. Another class of
methods works by agglomeration. These methods start out with each data point
forming its own cluster and gradually merge them into larger and larger clusters
until all points have been gathered together into one big cluster. Toward the
beginning of the process, the clusters are very small and very pure”the members
of each cluster are few and closely related. Towards the end of the process, the
clusters are large and not as well defined. The entire history is preserved making
it possible to choose the level of clustering that works best for a given application.

An Agglomerative Clustering Algorithm
The first step is to create a similarity matrix. The similarity matrix is a table of
all the pair-wise distances or degrees of similarity between clusters. Initially,
the similarity matrix contains the pair-wise distance between individual pairs
of records. As discussed earlier, there are many measures of similarity between
records, including the Euclidean distance, the angle between vectors, and the
ratio of matching to nonmatching categorical fields. The issues raised by the
choice of distance measures are exactly the same as those previously discussed
in relation to the K-means approach.
It might seem that with N initial clusters for N data points, N2 measurement
calculations are required to create the distance table. If the similarity measure
is a true distance metric, only half that is needed because all true distance met­
rics follow the rule that Distance(X,Y) = Distance(Y,X). In the vocabulary of
mathematics, the similarity matrix is lower triangular. The next step is to find
the smallest value in the similarity matrix. This identifies the two clusters that
are most similar to one another. Merge these two clusters into a new one and
update the similarity matrix by replacing the two rows that described the par­
ent cluster with a new row that describes the distance between the merged
cluster and the remaining clusters. There are now N “ 1 clusters and N “ 1 rows
in the similarity matrix.
Repeat the merge step N “ 1 times, so all records belong to the same large
cluster. Each iteration remembers which clusters were merged and the dis­
tance between them. This information is used to decide which level of cluster­
ing to make use of.

Distance between Clusters
A bit more needs to be said about how to measure distance between clusters.
On the first trip through the merge step, the clusters consist of single records
so the distance between clusters is the same as the distance between records, a
subject that has already been covered in perhaps too much detail. Second and
Automatic Cluster Detection 369

subsequent trips through the loop need to update the similarity matrix with
the distances from the new, multirecord cluster to all the others. How do we
measure this distance?
As usual, there is a choice of approaches. Three common ones are:
Single linkage

Complete linkage

Centroid distance

In the single linkage method, the distance between two clusters is given by the
distance between the closest members. This method produces clusters with the
property that every member of a cluster is more closely related to at least one
member of its cluster than to any point outside it.
Another approach is the complete linkage method, where the distance between
two clusters is given by the distance between their most distant members. This
method produces clusters with the property that all members lie within some
known maximum distance of one another.
Third method is the centroid distance, where the distance between two clusters
is measured between the centroids of each. The centroid of a cluster is its average
element. Figure 11.10 gives a pictorial representation of these three methods.

Closest clusters by
centroid method


Closest clusters by

complete linkage method

Closest clusters by
single linkage method


Figure 11.10 Three methods of measuring the distance between clusters.
370 Chapter 11

Clusters and Trees
The agglomeration algorithm creates hierarchical clusters. At each level in the
hierarchy, clusters are formed from the union of two clusters at the next level
down. A good way of visualizing these clusters is as a tree. Of course, such a
tree may look like the decision trees discussed in Chapter 6, but there are some
important differences. The most important is that the nodes of the cluster tree
do not embed rules describing why the clustering takes place; the nodes sim­
ply state the fact that the two children have the minimum distance of all pos­
sible clusters pairs. Another difference is that a decision tree is created to
maximize the leaf purity of a given target variable. There is no target for the
cluster tree, other than self-similarity within each cluster. Later in this chapter
we™ll discuss divisive clustering methods. These are similar to the agglomera­
tive methods, except that agglomerative methods are build by starting from
the leaving and working towards the root whereas divisive methods start at
the root and work down to the leaves.

Clustering People by Age: An Example of
Agglomerative Clustering
This illustration of agglomerative clustering uses an example in one dimen­
sion with the single linkage measure for distance between clusters. These
choices make it possible to follow the algorithm through all its iterations with­
out having to worry about calculating distances using squares and square
The data consists of the ages of people at a family gathering. The goal is to
cluster the participants using their age, and the metric for the distance between
two people is simply the difference in their ages. The metric for the distance
between two clusters of people is the difference in age between the oldest
member of the younger cluster and the youngest member of the older cluster.
(The one dimensional version of the single linkage measure.)
Because the distances are so easy to calculate, the example dispenses with
the similarity matrix. The procedure is to sort the participants by age, then
begin clustering by first merging clusters that are 1 year apart, then 2 years,
and so on until there is only one big cluster.
Figure 11.11 shows the clusters after six iterations, with three clusters
remaining. This is the level of clustering that seems the most useful. The
algorithm appears to have clustered the population into three generations:
children, parents, and grandparents.
Automatic Cluster Detection 371

Distance 6

Distance 5


. 73
( 137 .)