<<

. 54
( 137 .)



>>



Choosing a Balanced Set of Historical Records
The training set is a set of historical records. It needs to provide good coverage
of the population so that the nearest neighbors of an unknown record are use­
ful for predictive purposes. A random sample may not provide sufficient cov­
erage for all values. Some categories are much more frequent than others and
the more frequent categories dominate the random sample.
For instance, fraudulent transactions are much rarer than non-fraudulent
transactions, heart disease is much more common than liver cancer, news sto­
ries about the computer industry more common than about plastics, and so on.



Team-Fly®
Memory-Based Reasoning and Collaborative Filtering 263


To achieve balance, the training set should, if possible, contain roughly equal
numbers of records representing the different categories.

T I P When selecting the training set for MBR, be sure that each category has
roughly the same number of records supporting it. As a general rule of thumb,
several dozen records for each category are a minimum to get adequate
support and hundreds or thousands of examples are not unusual.




Representing the Training Data
The performance of MBR in making predictions depends on how the training
set is represented. The scatter plot approach illustrated in Figure 8.2 works for
two or three variables and a small number of records, but it does not scale well.
The simplest method for finding nearest neighbors requires finding the dis­
tance from the unknown case to each of the records in the training set and
choosing the training records with the smallest distances. As the number of
records grows, the time needed to find the neighbors for a new record grows
quickly.
This is especially true if the records are stored in a relational database. In this
case, the query looks something like:

SELECT distance(),rec.category

FROM historical_records rec

ORDER BY 1 ASCENDING;



The notation distance() fills in for whatever the particular distance function
happens to be. In this case, all the historical records need to be sorted in order
to get the handful needed for the nearest neighbors. This requires a full-table
scan plus a sort”quite an expensive couple of operations. It is possible to elim­
inate the sort by walking through table and keeping another table of the near­
est, inserting and deleting records as appropriate. Unfortunately, this approach
is not readily expressible in SQL without using a procedural language.
The performance of relational databases is pretty good nowadays. The chal­
lenge with scoring data for MBR is that each case being scored needs to be
compared against every case in the database. Scoring a single new record does
not take much time, even when there are millions of historical records. How­
ever, scoring many new records can have poor performance.
Another way to make MBR more efficient is to reduce the number of records
in the training set. Figure 8.2 shows a scatter plot for categorical data. This
graph has a well-defined boundary between the two regions. The points above
the line are all diamonds and those below the line are all circles. Although this
graph has forty points in it, most of the points are redundant. That is, they are
not really necessary for classification purposes.
264 Chapter 8


1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
Figure 8.2 Perhaps the cleanest training set for MBR is one that divides neatly into two
disjoint sets.


Figure 8.3 shows that only eight points in it are needed to get the same
results. Given that the size of the training set has such a large influence on the
performance of MBR, being able to reduce the size is a significant performance
boost.
How can this reduced set of records be found? The most practical method is
to look for clusters containing records belonging to different categories. The
centers of the clusters can then be used as a reduced set. This works well when
the different categories are quite separate. However, when there is some over­
lap and the categories are not so well-defined, using clusters to reduce the size
of the training set can cause MBR to produce poor results. Finding an optimal
set of “support records” has been an area of recent research. When such an
optimal set can be found, the historical records can sometimes be reduced to
the level where they fit inside a spreadsheet, making it quite efficient to apply
MBR to new records on less powerful machines.
Memory-Based Reasoning and Collaborative Filtering 265


1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
Figure 8.3 This smaller set of points returns the same results as in Figure 8.2 using MBR.



Determining the Distance Function, Combination
Function, and Number of Neighbors
The distance function, combination function, and number of neighbors are the
key ingredients in using MBR. The same set of historical records can prove
very useful or not at all useful for predictive purposes, depending on these cri­
teria. Fortunately, simple distance functions and combination functions usu­
ally work quite well. Before discussing these issues in detail, let™s look at a
detailed case study.


Case Study: Classifying News Stories
This case study uses MBR to assign classification codes to news stories and is
based on work conducted by one of the authors. The results from this case
study show that MBR can perform as well as people on a problem involving
hundreds of categories and data on a difficult-to-use type of data, free-text.1

1
This case study is a summarization of research conducted by one of the authors. Complete details
are available in the article “Classifying News Stories using Memory Based Reasoning,” by David
Waltz, Brij Masand, and Gordon Linoff, in Proceedings, SIGIR ˜92, published by ACM Press.
266 Chapter 8


What Are the Codes?
The classification codes are keywords used to describe the content of news sto­
ries. These codes are added to stories by a news retrieval service to help users
search for stories of interest. They help automate the process of routing partic­
ular stories to particular customers and help implement personalized profiles.
For instance, an industry analyst who specializes in the automotive industry
(or anyone else with an interest in the topic) can simplify searches by looking
for documents with the “automotive industry” code. Because knowledgeable
experts, also known as editors, set up the codes, the right stories are retrieved.
Editors or expert systems have traditionally assigned these codes. This case
study investigated the use of MBR for this purpose.
The codes used in this study fall into six categories:
Government Agency
––

Industry
––

Market Sector
––

Product
––

Region
––

Subject
––


The data contained 361 separate codes, distributed as follows in the training
set (Table 8.2).
The number and types of codes assigned to stories varied. Almost all the
stories had region and subject codes”and, on average, almost three region
codes per story. At the other extreme, relatively few stories contained govern­
ment and product codes, and such stories rarely had more than one such code.

Table 8.2 Six Types of Codes Used to Classify News Stories

CATEGORY # CODES # DOCS # OCCURRENCES
Government (G/) 28 3,926 4,200

Industry (I/) 112 38,308 57,430

Market Sector (M/) 9 38,562 42,058

Product (P/) 21 2,242 2,523
Region (R/) 121 47,083 116,358

Subject (N/) 70 41,902 52,751
Memory-Based Reasoning and Collaborative Filtering 267


Applying MBR
This section explains how MBR facilitated assigning codes to news stories for
a news service. The important steps were:
1. Choosing the training set
2. Determining the distance function
3. Choosing the number of nearest neighbors
4. Determining the combination function
The following sections discuss each of these steps in turn.

Choosing the Training Set
The training set consisted of 49,652 news stories, provided by the news
retrieval service for this purpose. These stories came from about three months
of news and from almost 100 different sources. Each story contained, on aver­
age, 2,700 words and had eight codes assigned to it. The training set was not
specially created, so the frequency of codes in the training set varied a great
deal, mimicking the overall frequency of codes in news stories in general.
Although this training set yielded good results, a better-constructed training
set with more examples of the less common codes would probably have per­
formed even better.

Choosing the Distance Function
The next step is choosing the distance function. In this case, a distance function
already existed, based on a notion called relevance feedback that measures the
similarity of two documents based on the words they contain. Relevance feed­
back, which is described more fully in the sidebar, was originally designed to
return documents similar to a given document, as a way of refining searches.
The most similar documents are the neighbors used for MBR.

Choosing the Combination Function
The next decision is the combination function. Assigning classification codes
to news stories is a bit different from most classification problems. Most classi­
fication problems are looking for the single best solution. However, news sto­
ries can have multiple codes, even from the same category. The ability to adapt
MBR to this problem highlights its flexibility.
268 Chapter 8



USING RELEVANCE FEEDBACK TO CREATE A DISTANCE FUNCTION

Relevance feedback is a powerful technique that allows users to refine
searches on text databases by asking the database to return documents similar
to one they already have. (Hubs and authorities, another method for improving
search results on hyperlinked web pages, is described in Chapter 10.) In the
course of doing this, the text database scores all the other documents in the
database and returns those that are most similar”along with a measure of
similarity. This is the relevance feedback score, which can be used as the basis
for a distance measure for MBR.
In the case study, the calculation of the relevance feedback score went as
follows:
1. Common, non-content-bearing words, such as “it,” “and,” and “of,” were
removed from the text of all stories in the training set. A total of 368
words in this category were identified and removed.
2. The next most common words, accounting for 20 percent of the words

<<

. 54
( 137 .)



>>