. 70
( 137 .)


another offer. However, a second look reveals that if changing carriers would
require her to change her telephone number it would be a big inconvenience
for Jane. (In the United States, number portability has been a long time com­
ing. It finally arrived in November 2003, shortly before this edition was pub­
lished, perhaps invalidating many existing churn models.) By looking at the
number of different people who call her, we see that Jane is quite dependent on
her wireless telephone number; she uses features like voicemail and stores
important numbers in her cell phone. The number of people she would have
to notify is inertia that keeps her from changing providers. John has no such
inertia and might have no allegiance to his wireless provider”as long as a
competing provider can provide uninterrupted service for his 45-minute call
on Wednesday mornings.
Jane also has a lot of influence. Since she talks to so many different people,
they will all know if she is satisfied or dissatisfied with her service. She is a
customer that the cellular company wants to keep happy. But, she is not a cus­
tomer that traditional methods of segmentation would have located.

The Power of Link Analysis
Link analysis is played two roles in this analysis of cellular phone data. The
first was visualization. The ability to see some of the graphs representing call
patterns makes patterns for things like inertia or influence much more obvi­
ous. Visualizing the data makes it possible to see patterns that lead to further
questions. For this example, we chose two profitable customers considered
similar by previous segmentation techniques. Link analysis showed their spe­
cific calling patterns and suggested how the customers differ. On the other
hand, looking at the call patterns for all customers at the same time would
require drawing a graph with hundreds of thousands or millions of nodes and
hundreds of millions of edges.
346 Chapter 10

Second, link analysis can apply the concepts generated by visualization to
larger sets of customers. For instance, a churn reduction program might avoid
targeting customers who have high inertia or be sure to target customers with
high influence. This requires traversing the call graph to calculate the inertia or
influence for all customers. Such derived characteristics can play an important
role in marketing efforts.
Different marketing programs might suggest looking for other features in
the call graph. For instance, perhaps the ability to place a conference call
would be desirable, but who would be the best prospects? One idea would be
to look for groups of customers that all call each other. Stated as a graph prob­
lem, this group is a fully connected subgraph. In the telephone industry, these
subgraphs are called “communities of interest.” A community of interest may
represent a group of customers who would be interested in the ability to place
conference calls.

Lessons Learned
Link analysis is an application of the mathematical field of graph theory. As a
data mining technique, link analysis has several strengths:
It capitalizes on relationships.

It is useful for visualization.

It creates derived characteristics that can be used for further mining.

Some data and data mining problems naturally involve links. As the two
case studies about telephone data show, link analysis is very useful for
telecommunications”a telephone call is a link between two people. Opportu­
nities for link analysis are most obvious in fields where the links are obvious
such as telephony, transportation, and the World Wide Web. Link analysis is
also appropriate in other areas where the connections do not have such a clear
manifestation, such as physician referral patterns, retail sales data, and foren­
sic analysis for crimes.
Links are a very natural way to visualize some types of data. Direct visual­
ization of the links can be a big aid to knowledge discovery. Even when auto­
mated patterns are found, visualization of the links helps to better understand
what is happening. Link analysis offers an alternative way of looking at data,
different from the formats of relational databases and OLAP tools. Links may
suggest important patterns in the data, but the significance of the patterns
requires a person for interpretation.
Link analysis can lead to new and useful data attributes. Examples include
calculating an authority score for a page on the World Wide Web and calculat­
ing the sphere of influence for a telephone user.
Link Analysis 347

Although link analysis is very powerful when applicable, it is not appropri­
ate for all types of problems. It is not a prediction tool or classification tool like
a neural network that takes data in and produces an answer. Many types of
data are simply not appropriate for link analysis. Its strongest use is probably
in finding specific patterns, such as the types of outgoing calls, which can then
be applied to data. These patterns can be turned into new features of the data,
for use in conjunction with other directed data mining techniques.


Automatic Cluster Detection

The data mining techniques described in this book are used to find meaning­
ful patterns in data. These patterns are not always immediately forthcoming.
Sometimes this is because there are no patterns to be found. Other times, the
problem is not the lack of patterns, but the excess. The data may contain so
much complex structure that even the best data mining techniques are unable
to coax out meaningful patterns. When mining such a database for the answer
to some specific question, competing explanations tend to cancel each other
out. As with radio reception, too many competing signals add up to noise.
Clustering provides a way to learn about the structure of complex data, to
break up the cacophony of competing signals into its components.
When human beings try to make sense of complex questions, our natural
tendency is to break the subject into smaller pieces, each of which can be
explained more simply. If someone were asked to describe the color of trees in
the forest, the answer would probably make distinctions between deciduous
trees and evergreens, and between winter, spring, summer, and fall. People
know enough about woodland flora to predict that, of all the hundreds of vari­
ables associated with the forest, season and foliage type, rather than say age
and height, are the best factors to use for forming clusters of trees that follow
similar coloration rules.
Once the proper clusters have been defined, it is often possible to find simple
patterns within each cluster. “In Winter, deciduous trees have no leaves so the
trees tend to be brown” or “The leaves of deciduous trees change color in the

350 Chapter 11

autumn, typically to oranges, reds, and yellows.” In many cases, a very noisy
dataset is actually composed of a number of better-behaved clusters. The ques­
tion is: how can these be found? That is where techniques for automatic cluster
detection come in”to help see the forest without getting lost in the trees.
This chapter begins with two examples of the usefulness of clustering”one
drawn from astronomy, another from clothing design. It then introduces the
K-Means clustering algorithm which, like the nearest neighbor techniques dis­
cussed in Chapter 8, depends on a geometric interpretation of data. The geo­
metric ideas used in K-Means bring up the more general topic of measures of
similarity, association, and distance. These distance measures are quite sensi­
tive to variations in how data is represented, so the next topic addressed is
data preparation for clustering, with special attention being paid to scaling
and weighting. K-Means is not the only algorithm in common use for auto­
matic cluster detection. This chapter contains brief discussions of several
others: Gaussian mixture models, agglomerative clustering, and divisive clus­
tering. (Another clustering technique, self-organizing maps, is covered in
Chapter 7 because self-organizing maps are a form of neural network.) The
chapter concludes with a case study in which automatic cluster detection is
used to evaluate editorial zones for a major daily newspaper.

Searching for Islands of Simplicity
In Chapter 1, where data mining techniques are classified as directed or undi­
rected, automatic cluster detection is described as a tool for undirected knowl­
edge discovery. In the technical sense, that is true because the automatic
cluster detection algorithms themselves are simply finding structure that
exists in the data without regard to any particular target variable. Most data
mining tasks start out with a preclassified training set, which is used to
develop a model capable of scoring or classifying previously unseen records.
In clustering, there is no preclassified data and no distinction between inde­
pendent and dependent variables. Instead, clustering algorithms search for
groups of records”the clusters”composed of records similar to each other.
The algorithms discover these similarities. It is up to the people running the
analysis to determine whether similar records represent something of interest
to the business”or something inexplicable and perhaps unimportant.
In a broader sense, however, clustering can be a directed activity because
clusters are sought for some business purpose. In marketing, clusters formed
for a business purpose are usually called “segments,” and customer segmen­
tation is a popular application of clustering.
Automatic cluster detection is a data mining technique that is rarely used in
isolation because finding clusters is not often an end in itself. Once clusters
have been detected, other methods must be applied in order to figure out what
Automatic Cluster Detection 351

the clusters mean. When clustering is successful, the results can be dramatic:
One famous early application of cluster detection led to our current under­
standing of stellar evolution.

Star Light, Star Bright
Early in the twentieth century, astronomers trying to understand the relation­
ship between the luminosity (brightness) of stars and their temperatures,
made scatter plots like the one in Figure 11.1. The vertical scale measures lumi­
nosity in multiples of the brightness of our own sun. The horizontal scale
measures surface temperature in degrees Kelvin (degrees centigrade above
absolute 0, the theoretical coldest possible temperature).


Red Giants

Luminosity (Sun = 1)

1 ain

10-4 White Dwarfs

40,000 20,000 10,000 5,000 2,500

Temperature (Degrees Kelvin)
Figure 11.1 The Hertzsprung-Russell diagram clusters stars by temperature and luminosity.
352 Chapter 11

Two different astronomers, Enjar Hertzsprung in Denmark and Norris
Russell in the United States, thought of doing this at about the same time. They
both observed that in the resulting scatter plot, the stars fall into three clusters.
This observation led to further work and the understanding that these three
clusters represent stars in very different phases of the stellar life cycle. The rela­
tionship between luminosity and temperature is consistent within each cluster,
but the relationship is different between the clusters because fundamentally
different processes are generating the heat and light. The 80 percent of stars that
fall on the main sequence are generating energy by converting hydrogen to
helium through nuclear fusion. This is how all stars spend most of their active
life. After some number of billions of years, the hydrogen is used up. Depend­
ing on its mass, the star then begins fusing helium or the fusion stops. In the lat­
ter case, the core of the star collapses, generating a great deal of heat in the

process. At the same time, the outer layer of gasses expands away from the core,

and a red giant is formed. Eventually, the outer layer of gasses is stripped away,
and the remaining core begins to cool. The star is now a white dwarf.
A recent search on Google using the phrase “Hertzsprung-Russell Diagram”
returned thousands of pages of links to current astronomical research based on
cluster detection of this kind. Even today, clusters based on the HR diagram
are being used to hunt for brown dwarfs (starlike objects that lack sufficient

mass to initiate nuclear fusion) and to understand pre“main sequence stellar

Fitting the Troops
The Hertzsprung-Russell diagram is a good introductory example of cluster­
ing because with only two variables, it is easy to spot the clusters visually
(and, incidentally, it is a good example of the importance of good data visual­
izations). Even in three dimensions, picking out clusters by eye from a scatter
plot cube is not too difficult. If all problems had so few dimensions, there
would be no need for automatic cluster detection algorithms. As the number
of dimensions (independent variables) increases, it becomes increasing diffi­
cult to visualize clusters. Our intuition about how close things are to each
other also quickly breaks down with more dimensions.


. 70
( 137 .)