. 74
( 137 .)


Distance 4

Distance 3

Distance 2

Distance 1












Figure 11.11 An example of agglomerative clustering.

Divisive Clustering
We have already noted some similarities between trees formed by the agglom­
erative clustering techniques and ones formed by decision tree algorithms.
Although the agglomerative methods work from the leaves to the root, while
the decision tree algorithms work from the root to the leaves, they both create
a similar hierarchical structure. The hierarchical structure reflects another sim­
ilarity between the methods. Decisions made early on in the process are never
revisited, which means that some fairly simple clusters may not be detected if
an early split or agglomeration destroys the structure.
Seeing the similarity between the trees produced by the two methods, it is
natural to ask whether the algorithms used for decision trees may also be used
for clustering. The answer is yes. A decision tree algorithm starts with the entire
collection of records and looks for a way to split it into partitions that are purer,
in some sense defined by a purity function. In the standard decision tree algo­
rithms, the purity function uses a separate variable”the target variable”to
make this decision. All that is required to turn decision trees into a clustering
372 Chapter 11

algorithm is to supply a purity function chosen to either minimize the average
intracluster distance or maximize the intercluster distances. An example of
such a purity function is the average distance from the centroid of the parent.
With no change in the purity function, we might say that decision trees pro­
vide directed clustering; that is, they create clusters of records that are similar
with respect to some target variable. For this reason, ordinary decision trees
are often a better choice for customer segmentation than the undirected clus­
tering algorithms discussed in this chapter. If the purpose of the customer seg­
mentation is to find customer segments that are loyal or profitable or likely to
respond to some particular offer, it makes sense to use one of those variables
(or a proxie) as the target for directed clustering. If, on the other hand, the
point of the customer segmentation is to stimulate discussion of new product
offerings geared to various naturally occurring clusters of customers, than an

undirected approach is more appropriate.

Self-Organizing Maps
Self-organizing maps are a variant of neural networks that have been used for
many years in applications such as feature detection in two-dimensional
images. More recently, they have been applied successfully for more general

clustering applications. There is a discussion of self-organizing networks in
Chapter 7.

Evaluating Clusters
When using the K-means approach to cluster detection, is there a way to deter­
mine what value of K finds the best clusters? Similarly, when using a hierar­
chical approach, is there a test for which level in the hierarchy contains the best
clusters? What does it mean to say that a cluster is good?
These questions are important when using clustering in practice. In general
terms, clusters should have members that have a high degree of similarity”or,
in geometric terms, are close to each other”and the clusters themselves
should be widely spaced.
A standard measure of within-cluster similarity is variance (the sum of the
squared differences of each element from the mean), so the best set of clusters
might be the set whose clusters have tthe lowest variance. However, this mea­
sure does not take into account cluster size. A similar measure would be the
average variance, the total variance divided by the size of the cluster.
For agglomerative clustering, using variance as a measure does not make
sense, since this method always starts out with clusters of one”which, of
course, have variance zero. A good measure to use with agglomerative clusters

Automatic Cluster Detection 373

is the difference between the distance value at which it was formed and the
distance value at which it is merged into the next level. This is a measure of the
durability of the cluster. Strong clusters, like the one linking 1 to 13-year-olds
at distance 3 in Figure 11.11, last through many iterations of the algorithm.
A general-purpose measure that works with any form of cluster detection is to
take whatever similarity measure or distance metric is used to form the clusters
and use it to compare the average distance between members of a cluster and
the cluster centroid to the average distance between cluster centroids. This can
be done for each cluster individually and for the entire collection of clusters.

TI P If there are one or two good clusters along with a number of weaker ones,
it may be possible to improve results by removing all members of the strong
clusters. The strong clusters are worthy of further analysis anyway, and removing
their strong pull may allow new clusters to be detected in the records left behind.

Inside the Cluster
Clustering often produces one or more strong clusters”rather large clusters
where the records are quite similar. The question is, what makes a strong clus­
ter special? What is it about the records in this cluster that causes them to be
lumped together? Even more importantly, is it possible to find rules and pat­
terns within this cluster now that the noise from the rest of the database has
been reduced?
The easiest way to approach these questions is to take the mean of each
variable within the cluster and compare it to the mean of the same variable in
the parent population. Rank order the variables by the magnitude of the dif­
ference, or better yet, the z-score. Looking at the variables that show the largest
difference between the cluster and the rest of the database goes a long way
towards explaining what makes the cluster special.

Outside the Cluster
Clustering can be useful even when only a single cluster is found. When
screening for a very rare defect, there may not be enough examples to train a
directed data mining model to detect it. One example is testing electric motors
at the factory that makes them. Cluster detection methods can be used on a
sample containing only good motors to determine the shape and size of the
“normal” cluster. When a motor comes along that falls outside the cluster for
any reason, it is suspect. This approach has been used in medicine to detect the
presence of abnormal cells in tissue samples and in telecommunications to
detect calling patterns indicative of fraud.
374 Chapter 11

Case Study: Clustering Towns

The Boston Globe is one of two major dailies serving Boston and the surround­
ing area of eastern Massachusetts and southern New Hampshire. The Globe is
the leading circulated newspaper in Boston with daily circulation of over
467,000 in 2003 compared to 243,000 for the Boston Herald, the other major daily
in town. On Sundays, the Globe has circulation of over 705,000. Despite this
leading position, in 2003 the Globe did not want to stand still. As with many
newspapers, it faced declining readership in its core Boston market and strong
competition from local papers in the suburban markets where some of its read­
ers have migrated.
In order to compete better with the suburban papers, the Globe introduced
geographically distinct versions of the paper with specialized editorial content
for each of 12 geographically defined zones. Two days a week, readers are
treated to a few pages of local coverage for their area. The editorial zones were
drawn up using data available to the Globe, common sense, and a map, but no
formal statistical analysis. There were some constraints on the composition of
the editorial zones:
The zones had to be geographically contiguous so that the trucks carry­

ing the localized editions from the central printing plant in Boston
could take sensible routes.
The zones had to be reasonably compact and contain sufficient popula­

tion to justify specialized editorial content.
The editorial zones had to be closely aligned with the geographic zones

used to sell advertising.
Within these constraints, the Globe wished to design editorial zones that
would group similar towns together. Sounds sensible, but which towns are
similar? That is the question that The Boston Globe brought to us at Data Miners.

Creating Town Signatures
Before deciding which towns belonged together, there needed to be a way of
describing the towns”a town signature with a column for every feature that
might be useful for characterizing a town and comparing it with its neighbors.
As it happened, Data Miners had worked on an earlier project to find towns
with good prospects for future circulation growth that had already defined
town signatures. Those signatures, which had been developed for a regression
model to predict Globe home delivery penetration, turned out to be equally
useful for undirected clustering. This is a fairly common occurrence; once a
useful set of descriptive attributes has been collected it can be used for all sorts
of things. In another example, a long-distance company developed customer
Automatic Cluster Detection 375

signatures based on call detail data in order to predict fraud and later found
that the same variables were useful for distinguishing between business and
residential users.

T I P Although the time and effort it takes to create a good customer signature
can seem daunting, the effort is repaid over time because the same attributes
often turn out to be predictive for many different target variables. The oft
quoted rule of thumb that 80 percent of the time spent on a data mining
project goes into data preparation becomes less true when the data
preparation effort can be amortized over several predictive modeling efforts.

The Data
The town signatures were derived from several sources, with most of the
variables coming from town-level U.S. Census data from 1990 and 2001. The
census data provides counts of the number of residents by age, race, ethnic
group, occupation, income, home value, average commute time, and many
other interesting variables. In addition, the Globe had household-level data on
its subscribers supplied by an outside data vendor as well as circulation
figures for each town and subscriber-level information on discount plans,
complaint calls, and type of subscription (daily, Sunday, or both).
There were four basic steps to creating the town signatures:
1. Aggregation
2. Normalization
3. Calculation of trends
4. Creation of derived variables
The first step in turning this data into a town signature was to aggregate
everything to the town level. For example, the subscriber data was aggregated
to produce the total number of subscribers and median subscriber household
income for each town.


. 74
( 137 .)