<<

. 71
( 137 .)



>>

Saying that a problem has many dimensions is an invitation to analyze it
geometrically. A dimension is each of the things that must be measured inde­
pendently in order to describe something. In other words, if there are N vari­
ables, imagine a space in which the value of each variable represents a distance
along the corresponding axis in an N-dimensional space. A single record con­
taining a value for each of the N variables can be thought of as the vector that
defines a particular point in that space. When there are two dimensions, this is
easily plotted. The HR diagram was one such example. Figure 11.2 is another
example that plots the height and weight of a group of teenagers as points on
a graph. Notice the clustering of boys and girls.

Team-Fly®
Automatic Cluster Detection 353


The chart in Figure 11.2 begins to give a rough idea of people™s shapes. But
if the goal is to fit them for clothes, a few more measurements are needed!
In the 1990s, the U.S. army commissioned a study on how to redesign the
uniforms of female soldiers. The army™s goal was to reduce the number of dif-
ferent uniform sizes that have to be kept in inventory, while still providing
each soldier with well-fitting uniforms.
As anyone who has ever shopped for women™s clothing is aware, there is
already a surfeit of classification systems (even sizes, odd sizes, plus sizes,
junior, petite, and so on) for categorizing garments by size. None of these
systems was designed with the needs of the U.S. military in mind. Susan
Ashdown and Beatrix Paal, researchers at Cornell University, went back to the
basics; they designed a new set of sizes based on the actual shapes of women
in the army.1


80




75
Height (Inches)




70




65




60




100 125 150 175 200

Weight (Pounds)
Figure 11.2 Heights and weights of a group of teenagers.



1
Ashdown, Susan P. 1998. “An Investigation of the Structure of Sizing Systems: A Comparison of
Three Multidimensional Optimized Sizing Systems Generated from Anthropometric Data,”
International Journal of Clothing Science and Technology. Vol. 10, #5, pp 324-341.
354 Chapter 11


Unlike the traditional clothing size systems, the one Ashdown and Paal came
up with is not an ordered set of graduated sizes where all dimensions increase
together. Instead, they came up with sizes that fit particular body types. Each
body type corresponds to a cluster of records in a database of body measure­
ments. One cluster might consist of short-legged, small-waisted, large-busted
women with long torsos, average arms, broad shoulders, and skinny necks
while other clusters capture other constellations of measurements.
The database contained more than 100 measurements for each of nearly
3,000 women. The clustering technique employed was the K-means algorithm,
described in the next section. In the end, only a handful of the more than 100
measurements were needed to characterize the clusters. Finding this smaller
number of variables was another benefit of the clustering process.


K-Means Clustering
The K-means algorithm is one of the most commonly used clustering algo­
rithms. The “K” in its name refers to the fact that the algorithm looks for a fixed
number of clusters which are defined in terms of proximity of data points to
each other. The version described here was first published by J. B. MacQueen in
1967. For ease of explaining, the technique is illustrated using two-dimensional
diagrams. Bear in mind that in practice the algorithm is usually handling many
more than two independent variables. This means that instead of points corre­
sponding to two-element vectors (x1,x2), the points correspond to n-element
vectors (x1,x2, . . . , xn). The procedure itself is unchanged.


Three Steps of the K-Means Algorithm
In the first step, the algorithm randomly selects K data points to be the seeds.
MacQueen™s algorithm simply takes the first K records. In cases where the
records have some meaningful order, it may be desirable to choose widely
spaced records, or a random selection of records. Each of the seeds is an
embryonic cluster with only one element. This example sets the number of
clusters to 3.
The second step assigns each record to the closest seed. One way to do this
is by finding the boundaries between the clusters, as shown geometrically
in Figure 11.3. The boundaries between two clusters are the points that are
equally close to each cluster. Recalling a lesson from high-school geometry
makes this less difficult than it sounds: given any two points, A and B, all
points that are equidistant from A and B fall along a line (called the perpen­
dicular bisector) that is perpendicular to the one connecting A and B and
halfway between them. In Figure 11.3, dashed lines connect the initial seeds;
the resulting cluster boundaries shown with solid lines are at right angles to
Automatic Cluster Detection 355


the dashed lines. Using these lines as guides, it is obvious which records are
closest to which seeds. In three dimensions, these boundaries would be planes
and in N dimensions they would be hyperplanes of dimension N “ 1. Fortu­
nately, computer algorithms easily handle these situations. Finding the actual
boundaries between clusters is useful for showing the process geometrically.
In practice, though, the algorithm usually measures the distance of each record
to each seed and chooses the minimum distance for this step.
For example, consider the record with the box drawn around it. On the basis
of the initial seeds, this record is assigned to the cluster controlled by seed
number 2 because it is closer to that seed than to either of the other two.
At this point, every point has been assigned to exactly one of the three clus­
ters centered around the original seeds. The third step is to calculate the cen­
troids of the clusters; these now do a better job of characterizing the clusters
than the initial seeds Finding the centroids is simply a matter of taking the
average value of each dimension for all the records in the cluster.
In Figure 11.4, the new centroids are marked with a cross. The arrows show
the motion from the position of the original seeds to the new centroids of the
clusters formed from those seeds.




Seed 3
Seed 2
X2




Seed 1




X1


Figure 11.3 The initial seeds determine the initial cluster boundaries.
356 Chapter 11




X2




X1


Figure 11.4 The centroids are calculated from the points that are assigned to each cluster.


The centroids become the seeds for the next iteration of the algorithm. Step 2
is repeated, and each point is once again assigned to the cluster with the closest
centroid. Figure 11.5 shows the new cluster boundaries”formed, as before, by
drawing lines equidistant between each pair of centroids. Notice that the point
with the box around it, which was originally assigned to cluster number 2, has
now been assigned to cluster number 1. The process of assigning points to clus­
ter and then recalculating centroids continues until the cluster boundaries
stop changing. In practice, the K-means algorithm usually finds a set of stable
clusters after a few dozen iterations.


What K Means
Clusters describe underlying structure in data. However, there is no one right
description of that structure. For instance, someone not from New York City
may think that the whole city is “downtown.” Someone from Brooklyn or
Queens might apply this nomenclature to Manhattan. Within Manhattan, it
might only be neighborhoods south of 23rd Street. And even there, “down­
town” might still be reserved only for the taller buildings at the southern tip of
the island. There is a similar problem with clustering; structures in data exist
at many different levels.
Automatic Cluster Detection 357




X2




X1


Figure 11.5 At each iteration, all cluster assignments are reevaluated.


Descriptions of K-means and related algorithms gloss over the selection of
K. But since, in many cases, there is no a priori reason to select a particular
value, there is really an outermost loop to these algorithms that occurs during
analysis rather than in the computer program. This outer loop consists of per­
forming automatic cluster detection using one value of K, evaluating the
results, then trying again with another value of K or perhaps modifying the
data. After each trial, the strength of the resulting clusters can be evaluated by
comparing the average distance between records in a cluster with the average
distance between clusters, and by other procedures described later in this
chapter. These tests can be automated, but the clusters must also be evaluated
on a more subjective basis to determine their usefulness for a given applica­
tion. As shown in Figure 11.6, different values of K may lead to very different
clusterings that are equally valid. The figure shows clusterings of a deck of
playing cards for K = 2 and K = 4. Is one better than the other? It depends on
the use to which the clusters will be put.
358 Chapter 11




Figure 11.6 These examples of clusters of size 2 and 4 in a deck of playing cards illustrate
that there is no one correct clustering.


Often the first time K-means clustering is run on a given set of data, most
of the data points fall in one giant central cluster and there are a number of
smaller clusters outside it. This is often because most records describe “nor-
mal” variations in the data, but there are enough outliers to confuse the clus-
tering algorithm. This type of clustering may be valuable for applications such
as identifying fraud or manufacturing defects. In other applications, it may be
desirable to filter outliers from the data; more often, the solution is to massage
the data values. Later in this chapter there is a section on data preparation for
clustering which describes how to work with variables to make it easier to find
meaningful clusters.


Similarity and Distance
Once records in a database have been mapped to points in space, automatic
cluster detection is really quite simple”a little geometry, some vector means,
et voilà! The problem, of course, is that the databases encountered in market-
ing, sales, and customer support are not about points in space. They are about
purchases, phone calls, airplane trips, car registrations, and a thousand other
things that have no obvious connection to the dots in a cluster diagram.
Clustering records of this sort requires some notion of natural association;
that is, records in a given cluster are more similar to each other than to records
in another cluster. Since it is difficult to convey intuitive notions to a computer,
Automatic Cluster Detection 359


this vague concept of association must be translated into some sort of numeric
measure of the degree of similarity. The most common method, but by no
means the only one, is to translate all fields into numeric values so that the
records may be treated as points in space. Then, if two points are close in
the geometric sense, they represent similar records in the database. There are
two main problems with this approach:
Many variable types, including all categorical variables and many
––

numeric variables such as rankings, do not have the right behavior to
properly be treated as components of a position vector.
In geometry, the contributions of each dimension are of equal impor­
––

tance, but in databases, a small change in one field may be much more
important than a large change in another field.

<<

. 71
( 137 .)



>>