Similarity Measures and Variable Type

Geometric distance works well as a similarity measure for well-behaved

numeric variables. A well-behaved numeric variable is one whose value indi

cates its placement along the axis that corresponds to it in our geometric

model. Not all variables fall into this category. For this purpose, variables fall

into four classes, listed here in increasing order of suitability for the geometric

model.

Categorical variables

––

Ranks

––

Intervals

––

True measures

––

Categorical variables only describe which of several unordered categories a

thing belongs to. For instance, it is possible to label one ice cream pistachio and

another butter pecan, but it is not possible to say that one is greater than the

other or judge which one is closer to black cherry. In mathematical terms, it is

possible to tell that X ≠ Y, but not whether X > Y or X < Y.

Ranks put things in order, but don™t say how much bigger one thing is than

another. The valedictorian has better grades than the salutatorian, but we

don™t know by how much. If X, Y, and Z are ranked A, B, and C, we know that

X > Y > Z, but we cannot define X-Y or Y-Z .

Intervals measure the distance between two observations. If it is 56° in San

Francisco and 78° in San Jose, then it is 22 degrees warmer at one end of the

bay than the other.

360 Chapter 11

True measures are interval variables that measure from a meaningful zero

point. This trait is important because it means that the ratio of two values of

the variable is meaningful. The Fahrenheit temperature scale used in the

United States and the Celsius scale used in most of the rest of the world do not

have this property. In neither system does it make sense to say that a 30° day is

twice as warm as a 15° day. Similarly, a size 12 dress is not twice as large as a

size 6, and gypsum is not twice as hard as talc though they are 2 and 1 on the

hardness scale. It does make perfect sense, however, to say that a 50-year-old

is twice as old as a 25-year-old or that a 10-pound bag of sugar is twice as

heavy as a 5-pound one. Age, weight, length, customer tenure, and volume are

examples of true measures.

Geometric distance metrics are well-defined for interval variables and true

measures. In order to use categorical variables and rankings, it is necessary to

transform them into interval variables. Unfortunately, these transformations

may add spurious information. If ice cream flavors are assigned arbitrary

numbers 1 through 28, it will appear that flavors 5 and 6 are closely related

while flavors 1 and 28 are far apart.

These and other data transformation and preparation issues are discussed

extensively in Chapter 17.

Formal Measures of Similarity

There are dozens if not hundreds of published techniques for measuring the

similarity of two records. Some have been developed for specialized applica

tions such as comparing passages of text. Others are designed especially for

use with certain types of data such as binary variables or categorical variables.

Of the three presented here, the first two are suitable for use with interval vari

ables and true measures, while the third is suitable for categorical variables.

Geometric Distance between Two Points

When the fields in a record are numeric, the record represents a point in

n-dimensional space. The distance between the points represented by two

records is used as the measure of similarity between them. If two points are

close in distance, the corresponding records are similar.

There are many ways to measure the distance between two points, as

discussed in the sidebar “Distance Metrics”. The most common one is the

Euclidian distance familiar from high-school geometry. To find the Euclidian

distance between X and Y, first find the differences between the corresponding

elements of X and Y (the distance along each axis) and square them. The dis

tance is the square root of the sum of the squared differences.

Automatic Cluster Detection 361

DISTANCE METRICS

Any function that takes two points and produces a single number describing a

relationship between them is a candidate measure of similarity, but to be a true

distance metric, it must meet the following criteria:

— Distance(X,Y) = 0 if and only if X = Y

— Distance(X,Y) ≥ 0 for all X and all Y

— Distance(X,Y) = Distance(Y,X)

— Distance(X,Y) ¤ Distance(X,Z) + Distance(Z,Y)

These are the formal definition of a distance metric in geometry.

A true distance is a good metric for clustering, but some of these conditions

can be relaxed. The most important conditions are the second and third (called

identity and commutativity by mathematicians)”that the measure is 0 or

positive and is well-defined for any two points. If two records have a distance

of 0, that is okay, as long as they are very, very similar, since they will always

fall into the same cluster.

The last condition, the Triangle Inequality, is perhaps the most interesting

mathematically. In terms of clustering, it basically means that adding a new

cluster center will not make two distant points suddenly seem close together.

Fortunately, most metrics we could devise satisfy this condition.

Angle between Two Vectors

Sometimes it makes more sense to consider two records closely associated

because of similarities in the way the fields within each record are related. Min

nows should cluster with sardines, cod, and tuna, while kittens cluster with

cougars, lions, and tigers, even though in a database of body-part lengths, the

sardine is closer to a kitten than it is to a catfish.

The solution is to use a different geometric interpretation of the same data.

Instead of thinking of X and Y as points in space and measuring the distance

between them, think of them as vectors and measure the angle between them.

In this context, a vector is the line segment connecting the origin of a coordi

nate system to the point described by the vector values. A vector has both mag

nitude (the distance from the origin to the point) and direction. For this

similarity measure, it is the direction that matters.

Take the values for length of whiskers, length of tail, overall body length,

length of teeth, and length of claws for a lion and a house cat and plot them as

single points, they will be very far apart. But if the ratios of lengths of these

body parts to one another are similar in the two species, than the vectors will

be nearly colinear.

362 Chapter 11

The angle between vectors provides a measure of association that is not

influenced by differences in magnitude between the two things being com

pared (see Figure 11.7). Actually, the sine of the angle is a better measure since

it will range from 0 when the vectors are closest (most nearly parallel) to 1

when they are perpendicular. Using the sine ensures that an angle of 0 degrees

is treated the same as an angle of 180 degrees, which is as it should be since for

this measure, any two vectors that differ only by a constant factor are consid

ered similar, even if the constant factor is negative. Note that the cosine of the

angle measures correlation; it is 1 when the vectors are parallel (perfectly

correlated) and 0 when they are orthogonal.

Y

FL

AM

TE

Big Fish

at

Big C

Little

Fish

Cat

Little

Figure 11.7 The angle between vectors as a measure of similarity.

Team-Fly®

Automatic Cluster Detection 363

Manhattan Distance

Another common distance metric gets its name from the rectangular grid pat

tern of streets in midtown Manhattan. It is simply the sum of the distances

traveled along each axis. This measure is sometimes preferred to the Euclidean

distance because given that the distances along each axis are not squared, it

is less likely that a large difference in one dimension will dominate the total

distance.

Number of Features in Common

When the preponderance of fields in the records are categorical variables, geo

metric measures are not the best choice. A better measure is based on the

degree of overlap between records. As with the geometric measures, there are

many variations on this idea. In all variations, the two records are compared

field by field to determine the number of fields that match and the number of

fields that don™t match. The simplest measure is the ratio of matches to the

total number of fields.

In its simplest form, this measure counts two null or empty fields as match

ing. This has the perhaps perverse result that everything with missing data

ends up in the same cluster. A simple improvement is to not include matches of

this sort in the match count. Another improvement is to weight the matches by

the prevalence of each class in the general population. After all, a match on

“Chevy Nomad” ought to count for more than a match on “Ford F-150 Pickup.”

Data Preparation for Clustering

The notions of scaling and weighting each play important roles in clustering.

Although similar, and often confused with each other, the two notions are not

the same. Scaling adjusts the values of variables to take into account the fact

that different variables are measured in different units or over different ranges.

For instance, household income is measured in tens of thousands of dollars

and number of children in single digits. Weighting provides a relative adjust

ment for a variable, because some variables are more important than others.

Scaling for Consistency

In geometry, all dimensions are equally important. Two points that differ by 2

in dimensions X and Y and by 1 in dimension Z are the same distance apart as

two other points that differ by 1 in dimension X and by 2 in dimensions Y and

Z. It doesn™t matter what units X, Y, and Z are measured in, so long as they are

the same.

364 Chapter 11

But what if X is measured in yards, Y is measured in centimeters, and Z is

measured in nautical miles? A difference of 1 in Z is now equivalent to a dif

ference of 185,200 in Y or 2,025 in X. Clearly, they must all be converted to a

common scale before distances will make any sense.

Unfortunately, in commercial data mining there is usually no common scale

available because the different units being used are measuring quite different

things. If variables include plot size, number of children, car ownership, and

family income, they cannot all be converted to a common unit. On the other

hand, it is misleading that a difference of 20 acres is indistinguishable from

a change of $20. One solution is to map all the variables to a common

range (often 0 to 1 or “1 to 1). That way, at least the ratios of change become

comparable”doubling the plot size has the same effect as doubling income.

Scaling solves this problem, in this case by remapping to a common range.

T I P It is very important to scale different variables so their values fall roughly

into the same range, by normalizing, indexing, or standardizing the values.

Here are three common ways of scaling variables to bring them all into com

parable ranges:

Divide each variable by the range (the difference between the lowest

––

and highest value it takes on) after subtracting the lowest value. This

maps all values to the range 0 to 1, which is useful for some data

mining algorithms.

Divide each variable by the mean of all the values it takes on. This is

––

often called “indexing a variable.”

Subtract the mean value from each variable and then divide it by the

––

standard deviation. This is often called standardization or “converting to

z-scores.” A z-score tells you how many standard deviations away from

the mean a value is.

Normalizing a single variable simply changes its range. A closely related

concept is vector normalization which scales all variables at once. This too has a

geometric interpretation. Consider the collection of values in a single record or

observation as a vector. Normalizing them scales each value so as to make the

length of the vector equal one. Transforming all the vectors to unit length

emphasizes the differences internal to each record rather than the differences

between records. As an example, consider two records with fields for debt and

equity. The first record contains debt of $200,000 and equity of $100,000; the

second, debt of $10,000 and equity of $5,000. After normalization, the two

records look the same since both have the same ratio of debt to equity.

Automatic Cluster Detection 365

Use Weights to Encode Outside Information

Scaling takes care of the problem that changes in one variable appear more

significant than changes in another simply because of differences in the

magnitudes of the values in the variable. What if we think that two families

with the same income have more in common than two families on the same

size plot, and we want that to be taken into consideration during clustering?

That is where weighting comes in. The purpose of weighting is to encode the

information that one variable is more (or less) important than others.

A good place to starts is by standardizing all variables so each has a mean of