CODES BY MBR CORRECT CODES RECALL PRECISION

A,B,C,D A,B,C,D 100% 100%

A,B A,B,C,D 50% 100%

A,B,C,D,E,F,G,H A,B,C,D 100% 50%

E,F A,B,C,D 0% 0%

A,B,E,F A,B,C,D 50% 50%

The original codes assigned to the stories by individual editors had a recall

of 83 percent and a precision of 88 percent with respect to the validated set of

correct codes. For MBR, the recall was 80 percent and the precision 72 percent.

However, Table 8.6 shows the average across all categories. MBR did

significantly better in some of the categories.

(continued)

274 Chapter 8

MEASURING THE EFFECTIVENESS OF ASSIGNING CODES:

RECALL AND PRECISION (continued)

Table 8.6 Recall and Precision Measurements by Code Category

CATEGORY RECALL PRECISION

Government 85% 87%

Industry 91% 85%

Market Sector 93% 91%

Product 69% 89%

Region 86% 64%

Subject 72% 53%

The variation in the results by category suggests that the original stories

used for the training set may not have been coded consistently. The results

from MBR can only be as good as the examples chosen for the training set.

Even so, MBR performed as well as all but the most experienced editors.

Building a Distance Function One Field at a Time

It is easy to understand distance as a geometric concept, but how can distance

be defined for records consisting of many different fields of different types?

The answer is, one field at a time. Consider some sample records such as those

shown in Table 8.7.

Figure 8.6 illustrates a scatter graph in three dimensions. The records are a bit

complicated, with two numeric fields and one categorical. This example shows

how to define field distance functions for each field, then combine them into a

single record distance function that gives a distance between two records.

Table 8.7 Five Customers in a Marketing Database

RECNUM GENDER AGE SALARY

1 female 27 $ 19,000

2 male 51 $ 64,000

3 male 52 $105,000

4 female 33 $ 55,000

5 male 45 $ 45,000

Memory-Based Reasoning and Collaborative Filtering 275

$120,000

$100,000

Salary

$80,000

$60,000

$40,000

Female

$20,000

Male

$0

25 30 35 40 45 50 55 60

Age

Figure 8.6 This scatter plot shows the five records from Table 8.7 in three dimensions”

age, salary, and gender”and suggests that standard distance is a good metric for nearest

neighbors.

The four most common distance functions for numeric fields are:

Absolute value of the difference: |A“B|

––

Square of the difference: (A“B)2

––

Normalized absolute value: |A“B|/(maximum difference)

––

Absolute value of difference of standardized values: |(A “ mean)/(stan-

––

dard deviation) “ (B “ mean)/(standard deviation)| which is equivalent

to |(A “ B)/(standard deviation)|

The advantage of the normalized absolute value is that it is always between

0 and 1. Since ages are much smaller than the salaries in this example, the nor

malized absolute value is a good choice for both of them”so neither field will

dominate the record distance function (difference of standardized values is

also a good choice). For the ages, the distance matrix looks like Table 8.8.

Table 8.8 Distance Matrix Based on Ages of Customers

27 51 52 33 45

27 0.00 0.96 1.00 0.24 0.72

51 0.96 0.00 0.04 0.72 0.24

52 1.00 0.04 0.00 0.76 0.28

33 0.24 0.72 0.76 0.00 0.48

45 0.72 0.24 0.28 0.48 0.00

276 Chapter 8

Gender is an example of categorical data. The simplest distance function is

the “identical to” function, which is 1 when the genders are the same and 0

otherwise:

dgender(female, female) = 0

dgender(female, male) = 1

dgender(female, female) = 1

dgender(male, male) = 0

So far, so simple. There are now three field distance functions that need to

merge into a single record distance function. There are three common ways to

do this:

Manhattan distance or summation:

––

dsum(A,B) = dgender(A,B) + dage(A,B) + dsalary(A,B)

Normalized summation: dnorm(A,B) = dsum(A,B) / max(dsum)

––

Euclidean distance:

––

dEuclid(A,B) = sqrt(dgender(A,B)2 + dage(A,B)2 + dsalary(A,B)2)

Table 8.9 shows the nearest neighbors for each of the points using the three

functions.

In this case, the sets of nearest neighbors are exactly the same regardless of

how the component distances are combined. This is a coincidence, caused by

the fact that the five records fall into two well-defined clusters. One of the clus

ters is lower-paid, younger females and the other is better-paid, older males.

These clusters imply that if two records are close to each other relative to one

field, then they are close on all fields, so the way the distances on each field are

combined is not important. This is not a very common situation.

Consider what happens when a new record (Table 8.10) is used for the

comparison.

Table 8.9 Set of Nearest Neighbors for Three Distance Functions, Ordered Nearest to

Farthest

DS U M DN O R M DE U C L I D

1 1,4,5,2,3 1,4,5,2,3 1,4,5,2,3

2 2,5,3,4,1 2,5,3,4,1 2,5,3,4,1

3 3,2,5,4,1 3,2,5,4,1 3,2,5,4,1

4 4,1,5,2,3 4,1,5,2,3 4,1,5,2,3

5 5,2,3,4,1 5,2,3,4,1 5,2,3,4,1

Memory-Based Reasoning and Collaborative Filtering 277

Table 8.10 New Customer

RECNUM GENDER AGE SALARY

new female 45 $100,000

This new record is not in either of the clusters. Table 8.11 shows her respec

tive distances from the training set with the list of her neighbors, from nearest

to furthest.

Now the set of neighbors depends on how the record distance function com

bines the field distance functions. In fact, the second nearest neighbor using

the summation function is the farthest neighbor using the Euclidean and vice

versa. Compared to the summation or normalized metric, the Euclidean met

ric tends to favor neighbors where all the fields are relatively close. It punishes

Record 3 because the genders are different and are maximally far apart (a dis

tance of 1.00). Correspondingly, it favors Record 1 because the genders are the

same. Note that the neighbors for dsum and dnorm are identical. The defini

tion of the normalized distance preserves the ordering of the summation

distance”the distances values are just shifted to the range from 0 to 1.

The summation, Euclidean, and normalized functions can also incorporate

weights so each field contributes a different amount to the record distance

function. MBR usually produces good results when all the weights are

equal to 1. However, sometimes weights can be used to incorporate a priori

knowledge, such as a particular field suspected of having a large effect on the

classification.

Distance Functions for Other Data Types

A 5-digit American zip code is often represented as a simple number. Do any

of the default distance functions for numeric fields make any sense? No. The

difference between two randomly chosen zip codes has no meaning. Well,

almost no meaning; a zip code does encode location information. The first

three digits represent a postal zone”for instance, all zip codes on Manhattan

start with “100,” “101,” or “102.”

Table 8.11 Set of Nearest Neighbors for New Customer

1 2 3 4 5 NEIGHBORS

dsum 1.662 1.659 1.338 1.003 1.640 4,3,5,2,1

dnorm 0.554 0.553 0.446 0.334 0.547 4,3,5,2,1

dEuclid 0.781 1.052 1.251 0.494 1.000 4,1,5,2,3

278 Chapter 8

Furthermore, there is a general pattern of zip codes increasing from East to

West. Codes that start with 0 are in New England and Puerto Rico; those

beginning with 9 are on the west coast. This suggests a distance function that

approximates geographic distance by looking at the high order digits of the

zip code.

dzip(A,B) = 0.0 if the zip codes are identical

––

dzip(A,B) = 0.1 if the first three digits are identical (e.g., “20008” and

––

“20015”

dzip(A,B) = 0.5 if the first digits are identical (e.g., “95050” and “98125”)

––

dzip(A,B) = 1.0 if the first digits are not identical (e.g., “02138” and

––

“94704”)

Of course, if geographic distance were truly of interest, a better approach

would be to look up the latitude and longitude of each zip code in a table and

calculate the distances that way (it is possible to get this information for the

United States from www.census.gov). For many purposes however, geographic

proximity is not nearly as important as some other measure of similarity. 10011

and 10031 are both in Manhattan, but from a marketing point of view, they

don™t have much else in common, because one is an upscale downtown neigh

borhood and the other is a working class Harlem neighborhood. On the other

hand 02138 and 94704 are on opposite coasts, but are likely to respond very

similarly to direct mail from a political action committee, since they are for

Cambridge, MA and Berkeley, CA respectively.