. 55
( 137 .)


in the database, were removed from the text. Because these words are
so common, they provide little information to distinguish between
3. The remaining words were collected into a dictionary of searchable terms.
Each was assigned a weight inversely proportional to its frequency in the
database. The particular weight was the negative of the base 2 log of the
term™s frequency in the training set.
4. Capitalized word pairs, such as “United States” and “New Mexico,” were
identified (automatically) and included in the dictionary of searchable
5. To calculate the relevance feedback score for two stories, the weights of
the searchable terms in both stories were added together. The algorithm
used for this case study included a bonus when searchable terms ap­
peared in close proximity in both stories.
The relevance feedback score is an example of the adaptation of an already-
existing function for use as a distance function. However, the score itself does
not quite fit the definition of a distance function. In particular, a score of 0
indicates that two stories have no words in common, instead of implying that
the stories are identical. The following transformation converts the relevance
feedback score to a function suitable for measuring the “distance” between
news stories:
dclassification (A,B) = 1 “
This is the function used to find the nearest neighbors. Actually, even this is
not a true distance function because d(A,B) is not the same as d(B,A), but it
works well enough.
Memory-Based Reasoning and Collaborative Filtering 269

Table 8.3 Classified Neighbors of a Not-Yet-Classified Story


1 0.076 0.924 R/FE,R/CA,R/CO

2 0.346 0.654 R/FE,R/JA,R/CA

3 0.369 0.631 R/FE,R/JA,R/MI

4 0.393 0.607 R/FE,R/JA,R/CA

The combination function used a weighted summation technique. Since the
maximum distance was 1, the weight was simply one minus the distance, so
weights would be big for neighbors at small distances and small for neighbors
at big distances. For example, say the neighbors of a story had the following
region codes and weights, shown in Table 8.3.
The total score for a code was then the sum of the weights of the neighbors
containing it. Then, codes with scores below a certain threshold value were
eliminated. For instance, the score for R/FE (which is the region code for the
Far East) is the sum of the weights of neighbors 1, 2, 3, and 4, since all of them
contain the R/FE, yielding a score of 2.816. Table 8.4 shows the results for the
six region codes contained by at least one of the four neighbors. For these
examples, a threshold of 1.0 leaves only three codes: R/CA, R/FE, and R/JA.
The particular choice of threshold was based on experimenting with different
values and is not important to understanding MBR.

Table 8.4 Code Scores for the Not-Yet-Classified Story

CODE 1 2 3 4 SCORE

R/CA 0.924 0 0 0.607 1.531

R/CO 0.924 0 0 0 0.924

R/FE 0.924 0.654 0.631 0.607 2.816

R/JA 0 0.654 0.631 0.607 1.892

R/MI 0 0.654 0 0 0.624
270 Chapter 8

Choosing the Number of Neighbors
The investigation varied the number of nearest neighbors between 1 and 11
inclusive. The best results came from using more neighbors. However, this
case study is different from many applications of MBR because it is assigning
multiple categories to each story. The more typical problem is to assign only a
single category or code and fewer neighbors would likely be sufficient for
good results.

The Results
To measure the effectiveness of MBR on coding, the news service had a panel
of editors review all the codes assigned, whether by editors or by MBR, to 200
stories. Only codes agreed upon by a majority of the panel were considered
The comparison of the “correct” codes to the codes originally assigned by
human editors was interesting. Eighty-eight percent of the codes originally
assigned to the stories (by humans) were correct. However, the human editors
made mistakes. A total of 17 percent of the codes originally assigned by human
editors were incorrect as shown in Figure 8.4.
MBR did not do quite as well. For MBR, the corresponding percentages
were 80 percent and 28 percent. That is, 80 percent of the codes assigned by
MBR were correct, but the cost was that 28 percent of the codes assigned were

Codes assigned by panel of experts

17% 88% 12%

Codes assigned by human editors Correct codes
Correct codes in not included in
codes in
classification classification

28% 80% 20%

Codes assigned by MBR

Figure 8.4 A comparison of results by human editors and by MBR on assigning codes
to news stories.
Memory-Based Reasoning and Collaborative Filtering 271

The mix of editors assigning the original codes, though, included novice,
intermediate, and experienced editors. The MBR system actually performed as
well as intermediate editors and better than novice editors. Also, MBR was
using stories classified by the same mix of editors, so the training set was not
consistently coded. Given the inconsistency in the training set, it is surprising
that MBR did as well as it did. The study was not able to investigate using
MBR on a training set whose codes were reviewed by the panel of experts
because there were not enough such stories for a viable training set.
This case study illustrates that MBR can be used for solving difficult prob­
lems that might not easily be solved by other means. Most data mining tech­
niques cannot handle textual data and assigning multiple categories at the
same time is problematic. This case study shows that, with some experimenta­
tion, MBR can produce results comparable to human experts. There is further
discussion of the metrics used to evaluate the performance of a document clas­
sification or retrieval system in the sidebar entitled Measuring the Effectiveness
of Assigning Codes. This study achieved these results with about two person-
months of effort (not counting development of the relevance feedback engine).
By comparison, other automated classification techniques, such as those based
on expert systems, require many person-years of effort to achieve equivalent
results for classifying news stories.

Measuring Distance
Say your travels are going to take you to a small town and you want to know
the weather. If you have a newspaper that lists weather reports for major cities,
what you would typically do is find the weather for cities near the small town.
You might look at the closest city and just take its weather, or do some sort of
combination of the forecasts for, say, the three closest cities. This is an example
of using MBR to find the weather forecast. The distance function being used is
the geographic distance between the two locations. It seems likely that the
Web services that provide a weather forecast for any zip code supplied by a
user do something similar.

What Is a Distance Function?
Distance is the way the MBR measures similarity. For any true distance metric,
the distance from point A to point B, denoted by d(A,B), has four key properties:
1. Well-defined. The distance between two points is always defined and
is a non-negative real number, d(A,B) ≥ 0.
2. Identity. The distance from one point to itself is always zero, so

d(A,A) = 0.

272 Chapter 8

3. Commutativity. Direction does not make a difference, so the distance
from A to B is the same as the distance from B to A: d(A,B) = d(B,A).
This property precludes one-way roads, for instance.
4. Triangle Inequality. Visiting an intermediate point C on the way from
A to B never shortens the distance, so d(A,B) ≥ d(A,C) + d(C,B).
For MBR, the points are really records in a database. This formal definition
of distance is the basis for measuring similarity, but MBR still works pretty
well when some of these constraints are relaxed a bit. For instance, the distance
function in the news story classification case study was not commutative; that
is, the distance from a news story A to another B was not always the same as
the distance from B to A. However, the similarity measure was still useful for
classification purposes.

What makes these properties useful for MBR? The fact that distance is well-

defined implies that every record has a neighbor somewhere in the database”
and MBR needs neighbors in order to work. The identity property makes
distance conform to the intuitive idea that the most similar record to a given
record is the original record itself. Commutativity and the Triangle Inequality
make the nearest neighbors local and well-behaved. Adding a new record into
the database will not bring an existing record any closer. Similarity is a matter

reserved for just two records at a time.
Although the distance measure used to find nearest neighbors is well-
behaved, the set of nearest neighbors can have some peculiar properties. For
instance, the nearest neighbor to a record B may be A, but A may have many
neighbors closer than B, as shown in Figure 8.5. This situation does not pose
problems for MBR.

B™s nearest
neighbor is A.




All these neighbors of X X

A are closer than B. X

Figure 8.5 B™s nearest neighbor is A, but A has many neighbors closer than B.

Memory-Based Reasoning and Collaborative Filtering 273


Recall and precision are two measurements that are useful for determining the
appropriateness of a set of assigned codes or keywords. The case study on
coding news stories, for instance, assigns many codes to news stories. Recall
and precision can be used to evaluate these assignments.
Recall answers the question: “How many of the correct codes did MBR
assign to the story?” It is the ratio of codes assigned by MBR that are correct
(as verified by editors) to the total number of correct codes on the story. If MBR
assigns all available codes to every story, then recall is 100 percent because the
correct codes all get assigned, along with many other irrelevant codes. If MBR
assigns no codes to any story, then recall is 0 percent.
Precision answers the question: “How many of the codes assigned by MBR
were correct?” It is the percentage of correct codes assigned by MBR to the
total number of codes assigned by MBR. Precision is 100 percent when MBR
assigns only correct codes to a story. It is close to 0 percent when MBR assigns
all codes to every story.
Neither recall nor precision individually give the full story of how good the
classification is. Ideally, we want 100 percent recall and 100 percent precision.
Often, it is possible to trade off one against the other. For instance, using more
neighbors increases recall, but decreases precision. Or, raising the threshold
increases precision but decreases recall. Table 8.5 gives some insight into these
measurements for a few specific cases.


. 55
( 137 .)