<<

. 56
( 137 .)



>>

Table 8.5 Examples of Recall and Precision
CODES BY MBR CORRECT CODES RECALL PRECISION

A,B,C,D A,B,C,D 100% 100%

A,B A,B,C,D 50% 100%

A,B,C,D,E,F,G,H A,B,C,D 100% 50%

E,F A,B,C,D 0% 0%

A,B,E,F A,B,C,D 50% 50%


The original codes assigned to the stories by individual editors had a recall
of 83 percent and a precision of 88 percent with respect to the validated set of
correct codes. For MBR, the recall was 80 percent and the precision 72 percent.
However, Table 8.6 shows the average across all categories. MBR did
significantly better in some of the categories.
(continued)
274 Chapter 8


MEASURING THE EFFECTIVENESS OF ASSIGNING CODES:
RECALL AND PRECISION (continued)


Table 8.6 Recall and Precision Measurements by Code Category
CATEGORY RECALL PRECISION

Government 85% 87%

Industry 91% 85%

Market Sector 93% 91%

Product 69% 89%

Region 86% 64%

Subject 72% 53%


The variation in the results by category suggests that the original stories
used for the training set may not have been coded consistently. The results
from MBR can only be as good as the examples chosen for the training set.
Even so, MBR performed as well as all but the most experienced editors.



Building a Distance Function One Field at a Time
It is easy to understand distance as a geometric concept, but how can distance
be defined for records consisting of many different fields of different types?
The answer is, one field at a time. Consider some sample records such as those
shown in Table 8.7.
Figure 8.6 illustrates a scatter graph in three dimensions. The records are a bit
complicated, with two numeric fields and one categorical. This example shows
how to define field distance functions for each field, then combine them into a
single record distance function that gives a distance between two records.

Table 8.7 Five Customers in a Marketing Database

RECNUM GENDER AGE SALARY

1 female 27 $ 19,000

2 male 51 $ 64,000

3 male 52 $105,000

4 female 33 $ 55,000

5 male 45 $ 45,000
Memory-Based Reasoning and Collaborative Filtering 275



$120,000
$100,000
Salary
$80,000

$60,000

$40,000

Female
$20,000

Male
$0
25 30 35 40 45 50 55 60
Age
Figure 8.6 This scatter plot shows the five records from Table 8.7 in three dimensions”
age, salary, and gender”and suggests that standard distance is a good metric for nearest
neighbors.


The four most common distance functions for numeric fields are:
Absolute value of the difference: |A“B|

––

Square of the difference: (A“B)2

––

Normalized absolute value: |A“B|/(maximum difference)

––

Absolute value of difference of standardized values: |(A “ mean)/(stan-

––

dard deviation) “ (B “ mean)/(standard deviation)| which is equivalent
to |(A “ B)/(standard deviation)|
The advantage of the normalized absolute value is that it is always between
0 and 1. Since ages are much smaller than the salaries in this example, the nor­
malized absolute value is a good choice for both of them”so neither field will
dominate the record distance function (difference of standardized values is
also a good choice). For the ages, the distance matrix looks like Table 8.8.

Table 8.8 Distance Matrix Based on Ages of Customers
27 51 52 33 45

27 0.00 0.96 1.00 0.24 0.72

51 0.96 0.00 0.04 0.72 0.24

52 1.00 0.04 0.00 0.76 0.28

33 0.24 0.72 0.76 0.00 0.48

45 0.72 0.24 0.28 0.48 0.00
276 Chapter 8


Gender is an example of categorical data. The simplest distance function is
the “identical to” function, which is 1 when the genders are the same and 0
otherwise:
dgender(female, female) = 0
dgender(female, male) = 1
dgender(female, female) = 1
dgender(male, male) = 0
So far, so simple. There are now three field distance functions that need to
merge into a single record distance function. There are three common ways to
do this:
Manhattan distance or summation:

––

dsum(A,B) = dgender(A,B) + dage(A,B) + dsalary(A,B)

Normalized summation: dnorm(A,B) = dsum(A,B) / max(dsum)

––

Euclidean distance:

––

dEuclid(A,B) = sqrt(dgender(A,B)2 + dage(A,B)2 + dsalary(A,B)2)

Table 8.9 shows the nearest neighbors for each of the points using the three
functions.
In this case, the sets of nearest neighbors are exactly the same regardless of
how the component distances are combined. This is a coincidence, caused by
the fact that the five records fall into two well-defined clusters. One of the clus­
ters is lower-paid, younger females and the other is better-paid, older males.
These clusters imply that if two records are close to each other relative to one
field, then they are close on all fields, so the way the distances on each field are
combined is not important. This is not a very common situation.
Consider what happens when a new record (Table 8.10) is used for the
comparison.

Table 8.9 Set of Nearest Neighbors for Three Distance Functions, Ordered Nearest to
Farthest

DS U M DN O R M DE U C L I D

1 1,4,5,2,3 1,4,5,2,3 1,4,5,2,3

2 2,5,3,4,1 2,5,3,4,1 2,5,3,4,1
3 3,2,5,4,1 3,2,5,4,1 3,2,5,4,1

4 4,1,5,2,3 4,1,5,2,3 4,1,5,2,3

5 5,2,3,4,1 5,2,3,4,1 5,2,3,4,1
Memory-Based Reasoning and Collaborative Filtering 277


Table 8.10 New Customer

RECNUM GENDER AGE SALARY

new female 45 $100,000


This new record is not in either of the clusters. Table 8.11 shows her respec­
tive distances from the training set with the list of her neighbors, from nearest
to furthest.
Now the set of neighbors depends on how the record distance function com­
bines the field distance functions. In fact, the second nearest neighbor using
the summation function is the farthest neighbor using the Euclidean and vice
versa. Compared to the summation or normalized metric, the Euclidean met­
ric tends to favor neighbors where all the fields are relatively close. It punishes
Record 3 because the genders are different and are maximally far apart (a dis­
tance of 1.00). Correspondingly, it favors Record 1 because the genders are the
same. Note that the neighbors for dsum and dnorm are identical. The defini­
tion of the normalized distance preserves the ordering of the summation
distance”the distances values are just shifted to the range from 0 to 1.
The summation, Euclidean, and normalized functions can also incorporate
weights so each field contributes a different amount to the record distance
function. MBR usually produces good results when all the weights are
equal to 1. However, sometimes weights can be used to incorporate a priori
knowledge, such as a particular field suspected of having a large effect on the
classification.


Distance Functions for Other Data Types
A 5-digit American zip code is often represented as a simple number. Do any
of the default distance functions for numeric fields make any sense? No. The
difference between two randomly chosen zip codes has no meaning. Well,
almost no meaning; a zip code does encode location information. The first
three digits represent a postal zone”for instance, all zip codes on Manhattan
start with “100,” “101,” or “102.”

Table 8.11 Set of Nearest Neighbors for New Customer

1 2 3 4 5 NEIGHBORS

dsum 1.662 1.659 1.338 1.003 1.640 4,3,5,2,1

dnorm 0.554 0.553 0.446 0.334 0.547 4,3,5,2,1

dEuclid 0.781 1.052 1.251 0.494 1.000 4,1,5,2,3
278 Chapter 8


Furthermore, there is a general pattern of zip codes increasing from East to
West. Codes that start with 0 are in New England and Puerto Rico; those
beginning with 9 are on the west coast. This suggests a distance function that
approximates geographic distance by looking at the high order digits of the
zip code.
dzip(A,B) = 0.0 if the zip codes are identical
––

dzip(A,B) = 0.1 if the first three digits are identical (e.g., “20008” and
––

“20015”
dzip(A,B) = 0.5 if the first digits are identical (e.g., “95050” and “98125”)
––

dzip(A,B) = 1.0 if the first digits are not identical (e.g., “02138” and
––

“94704”)
Of course, if geographic distance were truly of interest, a better approach
would be to look up the latitude and longitude of each zip code in a table and
calculate the distances that way (it is possible to get this information for the
United States from www.census.gov). For many purposes however, geographic
proximity is not nearly as important as some other measure of similarity. 10011
and 10031 are both in Manhattan, but from a marketing point of view, they
don™t have much else in common, because one is an upscale downtown neigh­
borhood and the other is a working class Harlem neighborhood. On the other
hand 02138 and 94704 are on opposite coasts, but are likely to respond very
similarly to direct mail from a political action committee, since they are for
Cambridge, MA and Berkeley, CA respectively.

<<

. 56
( 137 .)



>>