. 53
( 137 .)


work. A neural network is explained by its weights, and a very complicated
mathematical formula. Unfortunately, making sense of this is beyond our
human powers of comprehension.
Variations on neural networks, such as self-organizing maps, extend the
technology to undirected clustering. Overall neural networks are very power­
ful and can produce good models; they just can™t tell us how they do it.

Nearest Neighbor Approaches:
Memory-Based Reasoning and
Collaborative Filtering

You hear someone speak and immediately guess that she is from Australia.
Why? Because her accent reminds you of other Australians you have met. Or
you try a new restaurant expecting to like it because a friend with good taste
recommended it. Both cases are examples of decisions based on experience.
When faced with new situations, human beings are guided by memories of
similar situations they have experienced in the past. That is the basis for the
data mining techniques introduced in this chapter.
Nearest neighbor techniques are based on the concept of similarity.
Memory-based reasoning (MBR) results are based on analogous situations in
the past”much like deciding that a new friend is Australian based on past
examples of Australian accents. Collaborative filtering adds more information,
using not just the similarities among neighbors, but also their preferences. The
restaurant recommendation is an example of collaborative filtering.
Central to all these techniques is the idea of similarity. What really makes
situations in the past similar to a new situation? Along with finding the simi­
lar records from the past, there is the challenge of combining the informa­
tion from the neighbors. These are the two key concepts for nearest neighbor
This chapter begins with an introduction to MBR and an explanation of how
it works. Since measures of distance and similarity are important to nearest
neighbor techniques, there is a section on distance metrics, including a discus­
sion of the meaning of distance for data types, such as free text, that have no

258 Chapter 8

obvious geometric interpretation. The ideas of MBR are illustrated through a
case study showing how MBR has been used to attach keywords to news sto­
ries. The chapter then looks at collaborative filtering, a popular approach to
making recommendations, especially on the Web. Collaborative filtering is
also based on nearest neighbors, but with a slight twist”instead of grouping
restaurants or movies into neighborhoods, it groups the people recommend­
ing them.

Memory Based Reasoning
The human ability to reason from experience depends on the ability to recog­
nize appropriate examples from the past. A doctor diagnosing diseases, a
claims analyst flagging fraudulent insurance claims, and a mushroom hunter
spotting Morels are all following a similar process. Each first identifies similar
cases from experience and then applies what their knowledge of those cases to
the problem at hand. This is the essence of memory-based reasoning. A data­
base of known records is searched to find preclassified records similar to a new
record. These neighbors are used for classification and estimation.
Applications of MBR span many areas:
Fraud detection. New cases of fraud are likely to be similar to known

cases. MBR can find and flag them for further investigation.

Customer response prediction. The next customers likely to respond

to an offer are probably similar to previous customers who have

responded. MBR can easily identify the next likely customers.

Medical treatments. The most effective treatment for a given patient is
probably the treatment that resulted in the best outcomes for similar
patients. MBR can find the treatment that produces the best outcome.
Classifying responses. Free-text responses, such as those on the U.S. Cen­
sus form for occupation and industry or complaints coming from cus­
tomers, need to be classified into a fixed set of codes. MBR can process
the free-text and assign the codes.
One of the strengths of MBR is its ability to use data “as is.” Unlike other data
mining techniques, it does not care about the format of the records. It only cares
about the existence of two operations: A distance function capable of calculating
a distance between any two records and a combination function capable of com­
bining results from several neighbors to arrive at an answer. These functions
are readily defined for many kinds of records, including records with complex
or unusual data types such as geographic locations, images, and free text that
Memory-Based Reasoning and Collaborative Filtering 259

are usually difficult to handle with other analysis techniques. A case study
later in the chapter shows MBR™s successful application to the classification of
news stories”an example that takes advantage of the full text of the news
story to assign subject codes.
Another strength of MBR is its ability to adapt. Merely incorporating new
data into the historical database makes it possible for MBR to learn about new
categories and new definitions of old ones. MBR also produces good results
without a long period devoted to training or to massaging incoming data into
the right format.
These advantages come at a cost. MBR tends to be a resource hog since a
large amount of historical data must be readily available for finding neighbors.
Classifying new records can require processing all the historical records to find
the most similar neighbors”a more time-consuming process than applying an
already-trained neural network or an already-built decision tree. There is also
the challenge of finding good distance and combination functions, which often
requires a bit of trial and error and intuition.

Example: Using MBR to Estimate
Rents in Tuxedo, New York
The purpose of this example is to illustrate how MBR works by estimating the
cost of renting an apartment in the target town by combining data on rents in
several similar towns”its nearest neighbors.
MBR works by first identifying neighbors and then combining information
from them. Figure 8.1 illustrates the first of these steps. The goal is to make
predictions about the town of Tuxedo in Orange County, New York by looking
at its neighbors. Not its geographic neighbors along the Hudson and Delaware
rivers, rather its neighbors based on descriptive variables”in this case, popu­
lation and median home value. The scatter plot shows New York towns
arranged by these two variables. Figure 8.1 shows that measured this way,
Brooklyn and Queens are close neighbors, and both are far from Manhattan.
Although Manhattan is nearly as populous as Brooklyn and Queens, its home
prices put it in a class by itself.

T I P Neighborhoods can be found in many dimensions. The choice of
dimensions determines which records are close to one another. For some
purposes, geographic proximity might be important. For other purposes home
price or average lot size or population density might be more important. The
choice of dimensions and the choice of a distance metric are crucial to any
nearest-neighbor approach.
260 Chapter 8

The first stage of MBR finds the closest neighbor on the scatter plot shown
in Figure 8.1. Then the next closest neighbor is found, and so on until the
desired number are available. In this case, the number of neighbors is two and
the nearest ones turn out to be Shelter Island (which really is an island) way
out by the tip of Long Island™s North Fork, and North Salem, a town in North­
ern Westchester near the Connecticut border. These towns fall at about the
middle of a list sorted by population and near the top of one sorted by home
value. Although they are many miles apart, along these two dimensions, Shel­
ter Island and North Salem are very similar to Tuxedo.
Once the neighbors have been located, the next step is to combine informa­
tion from the neighbors to infer something about the target. For this example,
the goal is to estimate the cost of renting a house in Tuxedo. There is more than
one reasonable way to combine data from the neighbors. The census provides
information on rents in two forms. Table 8.1 shows what the 2000 census
reports about rents in the two towns selected as neighbors. For each town,
there is a count of the number of households paying rent in each of several
price bands as well as the median rent for each town. The challenge is to figure
out how best to use this data to characterize rents in the neighbors and then
how to combine information from the neighbors to come up with an estimate
that characterizes rents in Tuxedo in the same way.
Tuxedo™s nearest neighbors, the towns of North Salem and Shelter Island,
have quite different distributions of rents even though the median rents are
similar. In Shelter Island, a plurality of homes, 34.6 percent, rent in the $500 to
$750 range. In the town of North Salem, the largest number of homes, 30.9 per­
cent, rent in the $1,000 to $1,500 range. Furthermore, while only 3.1 percent of
homes in Shelter Island rent for over $1,500, 24.2 percent of homes in North
Salem do. On the other hand, at $804, the median rent in Shelter Island is above
the $750 ceiling of the most common range, while the median rent in North
Salem, $1,150, is below the floor of the most common range for that town. If
the average rent were available, it too would be a good candidate for character­
izing the rents in the various towns.

Table 8.1 The Neighbors

POPULA­ MEDIAN <$500 $750 $1500 $1000 >$1500 RENT
TOWN TION RENT (%) (%) (%) (%) (%) (%)

Shelter 2228 $804 3.1 34.6 31.4 10.7 3.1 17

North 5173 $1150 3 10.2 21.6 30.9 24.2 10.2
Population vs Home Value

New York




North Salem,

Median Home Value
Orange Brooklyn,

Shelter Island,

0 2 4 6 8 10 12 14 16

Log Population

Figure 8.1 Based on 2000 census population and home value, the town of Tuxedo
in Orange County has Shelter Island and North Salem as its two nearest neighbors.
Memory-Based Reasoning and Collaborative Filtering
262 Chapter 8

One possible combination function would be to average the most common
rents of the two neighbors. Since only ranges are available, we use the mid­
points. For Shelter Island, the midpoint of the most common range is $1,000.
For North Salem, it is $1,250. Averaging the two leads to an estimate for rent in
Tuxedo of $1,125. Another combination function would pick the point midway
between the two median rents. This second method leads to an estimate of
$977 for rents in Tuxedo.
As it happens, a plurality of rents in Tuxedo are in the $1,000 to $1,500 range
with the midpoint at $1,250. The median rent in Tuxedo is $907. So, averaging
the medians slightly overestimates the median rent in Tuxedo and averaging
the most common rents slightly underestimates the most common rent in
Tuxedo. It is hard to say which is better. The moral is that there is not always
an obvious “best” combination function.

Challenges of MBR
In the simple example just given, the training set consisted of all towns in New
York, each described by a handful of numeric fields such as the population,
median home value, and median rent. Distance was determined by placement

on a scatter plot with axes scaled to appropriate ranges, and the number of
neighbors arbitrarily set to two. The combination function was a simple
All of these choices seem reasonable. In general, using MBR involves several
1. Choosing an appropriate set of training records
2. Choosing the most efficient way to represent the training records
3. Choosing the distance function, the combination function, and the

number of neighbors

Let™s look at each of these in turn.


. 53
( 137 .)