. 72
( 137 .)


The following section introduces several alternative measures of similarity.

Similarity Measures and Variable Type
Geometric distance works well as a similarity measure for well-behaved
numeric variables. A well-behaved numeric variable is one whose value indi­
cates its placement along the axis that corresponds to it in our geometric
model. Not all variables fall into this category. For this purpose, variables fall
into four classes, listed here in increasing order of suitability for the geometric
Categorical variables



True measures

Categorical variables only describe which of several unordered categories a
thing belongs to. For instance, it is possible to label one ice cream pistachio and
another butter pecan, but it is not possible to say that one is greater than the
other or judge which one is closer to black cherry. In mathematical terms, it is
possible to tell that X ≠ Y, but not whether X > Y or X < Y.
Ranks put things in order, but don™t say how much bigger one thing is than
another. The valedictorian has better grades than the salutatorian, but we
don™t know by how much. If X, Y, and Z are ranked A, B, and C, we know that
X > Y > Z, but we cannot define X-Y or Y-Z .
Intervals measure the distance between two observations. If it is 56° in San
Francisco and 78° in San Jose, then it is 22 degrees warmer at one end of the
bay than the other.
360 Chapter 11

True measures are interval variables that measure from a meaningful zero
point. This trait is important because it means that the ratio of two values of
the variable is meaningful. The Fahrenheit temperature scale used in the
United States and the Celsius scale used in most of the rest of the world do not
have this property. In neither system does it make sense to say that a 30° day is
twice as warm as a 15° day. Similarly, a size 12 dress is not twice as large as a
size 6, and gypsum is not twice as hard as talc though they are 2 and 1 on the
hardness scale. It does make perfect sense, however, to say that a 50-year-old
is twice as old as a 25-year-old or that a 10-pound bag of sugar is twice as
heavy as a 5-pound one. Age, weight, length, customer tenure, and volume are
examples of true measures.
Geometric distance metrics are well-defined for interval variables and true
measures. In order to use categorical variables and rankings, it is necessary to
transform them into interval variables. Unfortunately, these transformations
may add spurious information. If ice cream flavors are assigned arbitrary
numbers 1 through 28, it will appear that flavors 5 and 6 are closely related
while flavors 1 and 28 are far apart.
These and other data transformation and preparation issues are discussed
extensively in Chapter 17.

Formal Measures of Similarity
There are dozens if not hundreds of published techniques for measuring the
similarity of two records. Some have been developed for specialized applica­
tions such as comparing passages of text. Others are designed especially for
use with certain types of data such as binary variables or categorical variables.
Of the three presented here, the first two are suitable for use with interval vari­
ables and true measures, while the third is suitable for categorical variables.

Geometric Distance between Two Points
When the fields in a record are numeric, the record represents a point in
n-dimensional space. The distance between the points represented by two
records is used as the measure of similarity between them. If two points are
close in distance, the corresponding records are similar.
There are many ways to measure the distance between two points, as
discussed in the sidebar “Distance Metrics”. The most common one is the
Euclidian distance familiar from high-school geometry. To find the Euclidian
distance between X and Y, first find the differences between the corresponding
elements of X and Y (the distance along each axis) and square them. The dis­
tance is the square root of the sum of the squared differences.
Automatic Cluster Detection 361


Any function that takes two points and produces a single number describing a
relationship between them is a candidate measure of similarity, but to be a true
distance metric, it must meet the following criteria:
— Distance(X,Y) = 0 if and only if X = Y
— Distance(X,Y) ≥ 0 for all X and all Y

— Distance(X,Y) = Distance(Y,X)
— Distance(X,Y) ¤ Distance(X,Z) + Distance(Z,Y)

These are the formal definition of a distance metric in geometry.
A true distance is a good metric for clustering, but some of these conditions
can be relaxed. The most important conditions are the second and third (called
identity and commutativity by mathematicians)”that the measure is 0 or
positive and is well-defined for any two points. If two records have a distance
of 0, that is okay, as long as they are very, very similar, since they will always
fall into the same cluster.
The last condition, the Triangle Inequality, is perhaps the most interesting
mathematically. In terms of clustering, it basically means that adding a new
cluster center will not make two distant points suddenly seem close together.
Fortunately, most metrics we could devise satisfy this condition.

Angle between Two Vectors
Sometimes it makes more sense to consider two records closely associated
because of similarities in the way the fields within each record are related. Min­
nows should cluster with sardines, cod, and tuna, while kittens cluster with
cougars, lions, and tigers, even though in a database of body-part lengths, the
sardine is closer to a kitten than it is to a catfish.
The solution is to use a different geometric interpretation of the same data.
Instead of thinking of X and Y as points in space and measuring the distance
between them, think of them as vectors and measure the angle between them.
In this context, a vector is the line segment connecting the origin of a coordi­
nate system to the point described by the vector values. A vector has both mag­
nitude (the distance from the origin to the point) and direction. For this
similarity measure, it is the direction that matters.
Take the values for length of whiskers, length of tail, overall body length,
length of teeth, and length of claws for a lion and a house cat and plot them as
single points, they will be very far apart. But if the ratios of lengths of these
body parts to one another are similar in the two species, than the vectors will
be nearly colinear.
362 Chapter 11

The angle between vectors provides a measure of association that is not
influenced by differences in magnitude between the two things being com­
pared (see Figure 11.7). Actually, the sine of the angle is a better measure since
it will range from 0 when the vectors are closest (most nearly parallel) to 1
when they are perpendicular. Using the sine ensures that an angle of 0 degrees
is treated the same as an angle of 180 degrees, which is as it should be since for
this measure, any two vectors that differ only by a constant factor are consid­
ered similar, even if the constant factor is negative. Note that the cosine of the
angle measures correlation; it is 1 when the vectors are parallel (perfectly
correlated) and 0 when they are orthogonal.

Big Fish

Big C


Figure 11.7 The angle between vectors as a measure of similarity.

Automatic Cluster Detection 363

Manhattan Distance
Another common distance metric gets its name from the rectangular grid pat­
tern of streets in midtown Manhattan. It is simply the sum of the distances
traveled along each axis. This measure is sometimes preferred to the Euclidean
distance because given that the distances along each axis are not squared, it
is less likely that a large difference in one dimension will dominate the total

Number of Features in Common
When the preponderance of fields in the records are categorical variables, geo­
metric measures are not the best choice. A better measure is based on the
degree of overlap between records. As with the geometric measures, there are
many variations on this idea. In all variations, the two records are compared
field by field to determine the number of fields that match and the number of
fields that don™t match. The simplest measure is the ratio of matches to the
total number of fields.
In its simplest form, this measure counts two null or empty fields as match­
ing. This has the perhaps perverse result that everything with missing data
ends up in the same cluster. A simple improvement is to not include matches of
this sort in the match count. Another improvement is to weight the matches by
the prevalence of each class in the general population. After all, a match on
“Chevy Nomad” ought to count for more than a match on “Ford F-150 Pickup.”

Data Preparation for Clustering
The notions of scaling and weighting each play important roles in clustering.
Although similar, and often confused with each other, the two notions are not
the same. Scaling adjusts the values of variables to take into account the fact
that different variables are measured in different units or over different ranges.
For instance, household income is measured in tens of thousands of dollars
and number of children in single digits. Weighting provides a relative adjust­
ment for a variable, because some variables are more important than others.

Scaling for Consistency
In geometry, all dimensions are equally important. Two points that differ by 2
in dimensions X and Y and by 1 in dimension Z are the same distance apart as
two other points that differ by 1 in dimension X and by 2 in dimensions Y and
Z. It doesn™t matter what units X, Y, and Z are measured in, so long as they are
the same.
364 Chapter 11

But what if X is measured in yards, Y is measured in centimeters, and Z is
measured in nautical miles? A difference of 1 in Z is now equivalent to a dif­
ference of 185,200 in Y or 2,025 in X. Clearly, they must all be converted to a
common scale before distances will make any sense.
Unfortunately, in commercial data mining there is usually no common scale
available because the different units being used are measuring quite different
things. If variables include plot size, number of children, car ownership, and
family income, they cannot all be converted to a common unit. On the other
hand, it is misleading that a difference of 20 acres is indistinguishable from
a change of $20. One solution is to map all the variables to a common
range (often 0 to 1 or “1 to 1). That way, at least the ratios of change become
comparable”doubling the plot size has the same effect as doubling income.
Scaling solves this problem, in this case by remapping to a common range.

T I P It is very important to scale different variables so their values fall roughly
into the same range, by normalizing, indexing, or standardizing the values.

Here are three common ways of scaling variables to bring them all into com­
parable ranges:
Divide each variable by the range (the difference between the lowest

and highest value it takes on) after subtracting the lowest value. This
maps all values to the range 0 to 1, which is useful for some data
mining algorithms.
Divide each variable by the mean of all the values it takes on. This is

often called “indexing a variable.”
Subtract the mean value from each variable and then divide it by the

standard deviation. This is often called standardization or “converting to
z-scores.” A z-score tells you how many standard deviations away from
the mean a value is.
Normalizing a single variable simply changes its range. A closely related
concept is vector normalization which scales all variables at once. This too has a
geometric interpretation. Consider the collection of values in a single record or
observation as a vector. Normalizing them scales each value so as to make the
length of the vector equal one. Transforming all the vectors to unit length
emphasizes the differences internal to each record rather than the differences
between records. As an example, consider two records with fields for debt and
equity. The first record contains debt of $200,000 and equity of $100,000; the
second, debt of $10,000 and equity of $5,000. After normalization, the two
records look the same since both have the same ratio of debt to equity.
Automatic Cluster Detection 365

Use Weights to Encode Outside Information
Scaling takes care of the problem that changes in one variable appear more
significant than changes in another simply because of differences in the
magnitudes of the values in the variable. What if we think that two families
with the same income have more in common than two families on the same
size plot, and we want that to be taken into consideration during clustering?
That is where weighting comes in. The purpose of weighting is to encode the
information that one variable is more (or less) important than others.
A good place to starts is by standardizing all variables so each has a mean of


. 72
( 137 .)