. 57
( 137 .)


This is just one example of how the choice of a distance metric depends on
the data mining context. There are additional examples of distance and simi­
larity measures in Chapter 11 where they are applied to clustering.

When a Distance Metric Already Exists
There are some situations where a distance metric already exists, but is diffi­
cult to spot. These situations generally arise in one of two forms. Sometimes, a
function already exists that provides a distance measure that can be adapted
for use in MBR. The news story case study provides a good example of adapt­
ing an existing function, the relevance feedback score, for use as a distance
Other times, there are fields that do not appear to capture distance, but can
be pressed into service. An example of such a hidden distance field is solicita­
tion history. Two customers who were chosen for a particular solicitation in
the past are “close,” even though the reasons why they were chosen may no
longer be available; two who were not chosen, are close, but not as close; and
one that was chosen and one that was not are far apart. The advantage of this
metric is that it can incorporate previous decisions, even if the basis for the
Memory-Based Reasoning and Collaborative Filtering 279

decisions is no longer available. On the other hand, it does not work well for
customers who were not around during the original solicitation; so some sort
of neutral weighting must be applied to them.
Considering whether the original customers responded to the solicitation
can extend this function further, resulting in a solicitation metric like:
dsolicitation(A, B) = 0, when A and B both responded to the solicitation

dsolicitation(A, B) = 0.1, when A and B were both chosen but neither



dsolicitation(A, B) = 0.2, when neither A nor B was chosen, but both were

available in the data
dsolicitation(A, B) = 0.3, when A and B were both chosen, but only one



dsolicitation(A, B) = 0.3, when one or both were not considered

dsolicitation(A, B) = 1.0, when one was chosen and the other was not

Of course, the particular values are not sacrosanct; they are only meant as a
guide for measuring similarity and showing how previous information and
response histories can be incorporated into a distance function.

The Combination Function: Asking the
Neighbors for the Answer
The distance function is used to determine which records comprise the neigh­
borhood. This section presents different ways to combine data gathered from
those neighbors to make a prediction. At the beginning of this chapter, we
estimated the median rent in the town of Tuxedo, by taking an average
of the median rents in similar towns. In that example, averaging was the
combination function. This section explores other methods of canvassing the

The Basic Approach: Democracy
One common combination function is for the k nearest neighbors to vote on an
answer””democracy” in data mining. When MBR is used for classification,
each neighbor casts its vote for its own class. The proportion of votes for each
class is an estimate of the probability that the new record belongs to the corre­
sponding class. When the task is to assign a single class, it is simply the one
with the most votes. When there are only two categories, an odd number of
neighbors should be poled to avoid ties. As a rule of thumb, use c+1 neighbors
when there are c categories to ensure that at least one class has a plurality.
280 Chapter 8

In Table 8.12, the five test cases seen earlier have been augmented with a flag
that signals whether the customer has become inactive.
For this example, three of the customers have become inactive and two have
not, an almost balanced training set. For illustrative purposes, let™s try to deter­
mine if the new record is active or inactive by using different values of k for
two distance functions, deuclid and dnorm (Table 8.13).
The question marks indicate that no prediction has been made due to a tie
among the neighbors. Notice that different values of k do affect the classifica­
tion. This suggests using the percentage of neighbors in agreement to provide
the level of confidence in the prediction (Table 8.14).

Table 8.12 Customers with Attrition History


1 female 27 $19,000 no

2 male 51 $64,000 yes

3 male 52 $105,000 yes

4 female 33 $55,000 yes

5 male 45 $45,000 no

new female 45 $100,000 ?

Table 8.13 Using MBR to Determine if the New Customer Will Become Inactive


dsum 4,3,5,2,1 Y,Y,N,Y,N yes yes yes yes yes

dEuclid 4,1,5,2,3 Y,N,N,Y,Y yes ? no ? yes

Table 8.14 Attrition Prediction with Confidence

K=1 K=2 K=3 K=4 K=5

dsum yes, 100% yes, 100% yes, 67% yes, 75% yes, 60%
dEuclid yes, 100% yes, 50% no, 67% yes, 50% yes, 60%
Memory-Based Reasoning and Collaborative Filtering 281

The confidence level works just as well when there are more than two cate­
gories. However, with more categories, there is a greater chance that no single
category will have a majority vote. One of the key assumptions about MBR
(and data mining in general) is that the training set provides sufficient infor­
mation for predictive purposes. If the neighborhoods of new cases consistently
produce no obvious choice of classification, then the data simply may not con­
tain the necessary information and the choice of dimensions and possibly of
the training set needs to be reevaluated. By measuring the effectiveness of
MBR on the test set, you can determine whether the training set has a sufficient
number of examples.

WA R N I N G MBR is only as good as the training set it uses. To measure
whether the training set is effective, measure the results of its predictions on
the test set using two, three, and four neighbors. If the results are inconclusive
or inaccurate, then the training set is not large enough or the dimensions and
distance metrics chosen are not appropriate.

Weighted Voting
Weighted voting is similar to voting in the previous section except that the
neighbors are not all created equal”more like shareholder democracy than
one-person, one-vote. The size of the vote is inversely proportional to the dis­
tance from the new record, so closer neighbors have stronger votes than neigh­
bors farther away do. To prevent problems when the distance might be 0, it is
common to add 1 to the distance before taking the inverse. Adding 1 also
makes all the votes between 0 and 1.
Table 8.15 applies weighted voting to the previous example. The “yes, cus­
tomer will become inactive” vote is the first; the “no, this is a good customer”
vote is second.
Weighted voting has introduced enough variation to prevent ties. The con­
fidence level can now be calculated as the ratio of winning votes to total votes
(Table 8.16).

Table 8.15 Attrition Prediction with Weighted Voting

K=1 K=2 K=3 K=4 K=5

dsum 0.749 to 0 1.441 to 0 1.441 2.085 to 2.085 to
to 0.647 0.647 1.290

dEuclid 0.669 to 0 0.669 to 0.669 to 1.157 to 1.601 to
0.562 1.062 1.062 1.062
282 Chapter 8

Table 8.16 Confidence with Weighted Voting

1 2 3 4 5

dsum yes, 100% yes, 100% yes, 69% yes, 76% yes, 62%

dEuclid yes, 100% yes, 54% no, 61% yes, 52% yes, 60%

In this case, weighting the votes has only a small effect on the results and the
confidence. The effect of weighting is largest when some neighbors are con­
siderably further away than others.
Weighting can also be applied to estimation by replacing the simple average
of neighboring values with an average weighted by distance. This approach is
used in collaborative filtering systems, as described in the following section.

Collaborative Filtering: A Nearest Neighbor
Approach to Making Recommendations
Neither of the authors considers himself a country music fan, but one of them
is the proud owner of an autographed copy of an early Dixie Chicks CD. The

Chicks, who did not yet have a major record label, were performing in a local
bar one day and some friends who knew them from Texas made a very enthu­
siastic recommendation. The performance was truly memorable, featuring
Martie Erwin™s impeccable Bluegrass fiddle, her sister Emily on a bewildering
variety of other instruments (most, but not all, with strings), and the seductive
vocals of Laura Lynch (who also played a stand-up electric bass). At the break,
the band sold and autographed a self-produced CD that we still like better
than the one that later won them a Grammy. What does this have to do with
nearest neighbor techniques? Well, it is a human example of collaborative fil­
tering. A recommendation from trusted friends will cause one to try something
one otherwise might not try.
Collaborative filtering is a variant of memory-based reasoning particularly
well suited to the application of providing personalized recommendations. A
collaborative filtering system starts with a history of people™s preferences. The
distance function determines similarity based on overlap of preferences”
people who like the same thing are close. In addition, votes are weighted by
distances, so the votes of closer neighbors count more for the recommenda­
tion. In other words, it is a technique for finding music, books, wine, or any­
thing else that fits into the existing preferences of a particular person by using
the judgments of a peer group selected for their similar tastes. This approach
is also called social information filtering.

Memory-Based Reasoning and Collaborative Filtering 283

Collaborative filtering automates the process of using word-of-mouth to
decide whether they would like something. Knowing that lots of people liked
something is not enough. Who liked it is also important. Everyone values some
recommendations more highly than others. The recommendation of a close
friend whose past recommendations have been right on target may be enough
to get you to go see a new movie even if it is in a genre you generally dislike.
On the other hand, an enthusiastic recommendation from a friend who thinks
Ace Ventura: Pet Detective is the funniest movie ever made might serve to warn
you off one you might otherwise have gone to see.
Preparing recommendations for a new customer using an automated col­
laborative filtering system has three steps:
1. Building a customer profile by getting the new customer to rate a selec­
tion of items such as movies, songs, or restaurants.
2. Comparing the new customer™s profile with the profiles of other cus­
tomers using some measure of similarity.
3. Using some combination of the ratings of customers with similar pro­
files to predict the rating that the new customer would give to items he
or she has not yet rated.
The following sections examine each of these steps in a bit more detail.

Building Profiles
One challenge with collaborative filtering is that there are often far more items
to be rated than any one person is likely to have experienced or be willing to
rate. That is, profiles are usually sparse, meaning that there is little overlap
among the users™ preferences for making recommendations. Think of a user
profile as a vector with one element per item in the universe of items to be
rated. Each element of the vector represents the profile owner™s rating for the
corresponding item on a scale of “5 to 5 with 0 indicating neutrality and null
values for no opinion.
If there are thousands or tens of thousands of elements in the vector and
each customer decides which ones to rate, any two customers™ profiles are
likely to end up with few overlaps. On the other hand, forcing customers to
rate a particular subset may miss interesting information because ratings of


. 57
( 137 .)