so common, they provide little information to distinguish between

documents.

3. The remaining words were collected into a dictionary of searchable terms.

Each was assigned a weight inversely proportional to its frequency in the

database. The particular weight was the negative of the base 2 log of the

term™s frequency in the training set.

4. Capitalized word pairs, such as “United States” and “New Mexico,” were

identified (automatically) and included in the dictionary of searchable

terms.

5. To calculate the relevance feedback score for two stories, the weights of

the searchable terms in both stories were added together. The algorithm

used for this case study included a bonus when searchable terms ap

peared in close proximity in both stories.

The relevance feedback score is an example of the adaptation of an already-

existing function for use as a distance function. However, the score itself does

not quite fit the definition of a distance function. In particular, a score of 0

indicates that two stories have no words in common, instead of implying that

the stories are identical. The following transformation converts the relevance

feedback score to a function suitable for measuring the “distance” between

news stories:

score(A,B)

dclassification (A,B) = 1 “

score(A,A)

This is the function used to find the nearest neighbors. Actually, even this is

not a true distance function because d(A,B) is not the same as d(B,A), but it

works well enough.

Memory-Based Reasoning and Collaborative Filtering 269

Table 8.3 Classified Neighbors of a Not-Yet-Classified Story

NEIGHBOR DISTANCE WEIGHT CODES

1 0.076 0.924 R/FE,R/CA,R/CO

2 0.346 0.654 R/FE,R/JA,R/CA

3 0.369 0.631 R/FE,R/JA,R/MI

4 0.393 0.607 R/FE,R/JA,R/CA

The combination function used a weighted summation technique. Since the

maximum distance was 1, the weight was simply one minus the distance, so

weights would be big for neighbors at small distances and small for neighbors

at big distances. For example, say the neighbors of a story had the following

region codes and weights, shown in Table 8.3.

The total score for a code was then the sum of the weights of the neighbors

containing it. Then, codes with scores below a certain threshold value were

eliminated. For instance, the score for R/FE (which is the region code for the

Far East) is the sum of the weights of neighbors 1, 2, 3, and 4, since all of them

contain the R/FE, yielding a score of 2.816. Table 8.4 shows the results for the

six region codes contained by at least one of the four neighbors. For these

examples, a threshold of 1.0 leaves only three codes: R/CA, R/FE, and R/JA.

The particular choice of threshold was based on experimenting with different

values and is not important to understanding MBR.

Table 8.4 Code Scores for the Not-Yet-Classified Story

CODE 1 2 3 4 SCORE

R/CA 0.924 0 0 0.607 1.531

R/CO 0.924 0 0 0 0.924

R/FE 0.924 0.654 0.631 0.607 2.816

R/JA 0 0.654 0.631 0.607 1.892

R/MI 0 0.654 0 0 0.624

270 Chapter 8

Choosing the Number of Neighbors

The investigation varied the number of nearest neighbors between 1 and 11

inclusive. The best results came from using more neighbors. However, this

case study is different from many applications of MBR because it is assigning

multiple categories to each story. The more typical problem is to assign only a

single category or code and fewer neighbors would likely be sufficient for

good results.

The Results

To measure the effectiveness of MBR on coding, the news service had a panel

of editors review all the codes assigned, whether by editors or by MBR, to 200

stories. Only codes agreed upon by a majority of the panel were considered

“correct.”

The comparison of the “correct” codes to the codes originally assigned by

human editors was interesting. Eighty-eight percent of the codes originally

assigned to the stories (by humans) were correct. However, the human editors

made mistakes. A total of 17 percent of the codes originally assigned by human

editors were incorrect as shown in Figure 8.4.

MBR did not do quite as well. For MBR, the corresponding percentages

were 80 percent and 28 percent. That is, 80 percent of the codes assigned by

MBR were correct, but the cost was that 28 percent of the codes assigned were

incorrect.

Codes assigned by panel of experts

17% 88% 12%

Codes assigned by human editors Correct codes

Incorrect

Correct codes in not included in

codes in

classification classification

classification

28% 80% 20%

Codes assigned by MBR

Figure 8.4 A comparison of results by human editors and by MBR on assigning codes

to news stories.

Memory-Based Reasoning and Collaborative Filtering 271

The mix of editors assigning the original codes, though, included novice,

intermediate, and experienced editors. The MBR system actually performed as

well as intermediate editors and better than novice editors. Also, MBR was

using stories classified by the same mix of editors, so the training set was not

consistently coded. Given the inconsistency in the training set, it is surprising

that MBR did as well as it did. The study was not able to investigate using

MBR on a training set whose codes were reviewed by the panel of experts

because there were not enough such stories for a viable training set.

This case study illustrates that MBR can be used for solving difficult prob

lems that might not easily be solved by other means. Most data mining tech

niques cannot handle textual data and assigning multiple categories at the

same time is problematic. This case study shows that, with some experimenta

tion, MBR can produce results comparable to human experts. There is further

discussion of the metrics used to evaluate the performance of a document clas

sification or retrieval system in the sidebar entitled Measuring the Effectiveness

of Assigning Codes. This study achieved these results with about two person-

months of effort (not counting development of the relevance feedback engine).

By comparison, other automated classification techniques, such as those based

on expert systems, require many person-years of effort to achieve equivalent

results for classifying news stories.

Measuring Distance

Say your travels are going to take you to a small town and you want to know

the weather. If you have a newspaper that lists weather reports for major cities,

what you would typically do is find the weather for cities near the small town.

You might look at the closest city and just take its weather, or do some sort of

combination of the forecasts for, say, the three closest cities. This is an example

of using MBR to find the weather forecast. The distance function being used is

the geographic distance between the two locations. It seems likely that the

Web services that provide a weather forecast for any zip code supplied by a

user do something similar.

What Is a Distance Function?

Distance is the way the MBR measures similarity. For any true distance metric,

the distance from point A to point B, denoted by d(A,B), has four key properties:

1. Well-defined. The distance between two points is always defined and

is a non-negative real number, d(A,B) ≥ 0.

2. Identity. The distance from one point to itself is always zero, so

d(A,A) = 0.

272 Chapter 8

3. Commutativity. Direction does not make a difference, so the distance

from A to B is the same as the distance from B to A: d(A,B) = d(B,A).

This property precludes one-way roads, for instance.

4. Triangle Inequality. Visiting an intermediate point C on the way from

A to B never shortens the distance, so d(A,B) ≥ d(A,C) + d(C,B).

For MBR, the points are really records in a database. This formal definition

of distance is the basis for measuring similarity, but MBR still works pretty

well when some of these constraints are relaxed a bit. For instance, the distance

function in the news story classification case study was not commutative; that

is, the distance from a news story A to another B was not always the same as

the distance from B to A. However, the similarity measure was still useful for

classification purposes.

Y

What makes these properties useful for MBR? The fact that distance is well-

FL

defined implies that every record has a neighbor somewhere in the database”

and MBR needs neighbors in order to work. The identity property makes

distance conform to the intuitive idea that the most similar record to a given

AM

record is the original record itself. Commutativity and the Triangle Inequality

make the nearest neighbors local and well-behaved. Adding a new record into

the database will not bring an existing record any closer. Similarity is a matter

TE

reserved for just two records at a time.

Although the distance measure used to find nearest neighbors is well-

behaved, the set of nearest neighbors can have some peculiar properties. For

instance, the nearest neighbor to a record B may be A, but A may have many

neighbors closer than B, as shown in Figure 8.5. This situation does not pose

problems for MBR.

X

B™s nearest

X X

neighbor is A.

X

XXX

X X

A B X

X

X X

XXX

X

All these neighbors of X X

A are closer than B. X

Figure 8.5 B™s nearest neighbor is A, but A has many neighbors closer than B.

Team-Fly®

Memory-Based Reasoning and Collaborative Filtering 273

MEASURING THE EFFECTIVENESS OF ASSIGNING CODES:

RECALL AND PRECISION

Recall and precision are two measurements that are useful for determining the

appropriateness of a set of assigned codes or keywords. The case study on

coding news stories, for instance, assigns many codes to news stories. Recall

and precision can be used to evaluate these assignments.

Recall answers the question: “How many of the correct codes did MBR

assign to the story?” It is the ratio of codes assigned by MBR that are correct

(as verified by editors) to the total number of correct codes on the story. If MBR

assigns all available codes to every story, then recall is 100 percent because the

correct codes all get assigned, along with many other irrelevant codes. If MBR

assigns no codes to any story, then recall is 0 percent.

Precision answers the question: “How many of the codes assigned by MBR

were correct?” It is the percentage of correct codes assigned by MBR to the

total number of codes assigned by MBR. Precision is 100 percent when MBR

assigns only correct codes to a story. It is close to 0 percent when MBR assigns

all codes to every story.

Neither recall nor precision individually give the full story of how good the

classification is. Ideally, we want 100 percent recall and 100 percent precision.

Often, it is possible to trade off one against the other. For instance, using more

neighbors increases recall, but decreases precision. Or, raising the threshold

increases precision but decreases recall. Table 8.5 gives some insight into these

measurements for a few specific cases.