. 58
( 137 .)


more obscure items may say more about the customer than ratings of common
ones. A fondness for the Beatles is less revealing than a fondness for Mose
A reasonable approach is to have new customers rate a list of the twenty or
so most frequently rated items (a list that might change over time) and then
free them to rate as many additional items as they please.
284 Chapter 8

Comparing Profiles
Once a customer profile has been built, the next step is to measure its distance
from other profiles. The most obvious approach would be to treat the profile
vectors as geometric points and calculate the Euclidean distance between
them, but many other distance measures have been tried. Some give higher
weight to agreement when users give a positive rating especially when most
users give negative ratings to most items. Still others apply statistical correla­
tion tests to the ratings vectors.

Making Predictions
The final step is to use some combination of nearby profiles in order to come
up with estimated ratings for the items that the customer has not rated. One
approach is to take a weighted average where the weight is inversely propor­
tional to the distance. The example shown in Figure 8.7 illustrates estimating
the rating that Nathaniel would give to Planet of the Apes based on the opinions
of his neighbors, Simon and Amelia.



Crouching Tiger
Apocalypse Now
Vertical Ray of Sun
Planet Of The Apes “1 Crouching Tiger
Osmosis Jones Apocalypse Now
American Pie 2 Vertical Ray of Sun
Plan 9 From Outer Space Planet Of The Apes “4
. Osmosis Jones
American Pie 2
Plan 9 From Outer Space


Figure 8.7 The predicted rating for Planet of the Apes is “2.66.
Memory-Based Reasoning and Collaborative Filtering 285

Simon, who is distance 2 away, gave that movie a rating of “1. Amelia, who
is distance 4 away, gave that movie a rating of “4. No one else™s profile is close
enough to Nathaniel™s to be included in the vote. Because Amelia is twice as
far away as Simon, her vote counts only half as much as his. The estimate for
Nathaniel™s rating is weighted by the distance:
(1„2 (“1) + 1„4 (“4)) / (1„2 +1„4)= “1.5/0.75= “2.
A good collaborative filtering system gives its users a chance to comment on
the predictions and adjust the profile accordingly. In this example, if Nathaniel
rents the video of Planet of the Apes despite the prediction that he will not like
it, he can then enter an actual rating of his own. If it turns out that he really
likes the movie and gives it a rating of 4, his new profile will be in a slightly
different neighborhood and Simon™s and Amelia™s opinions will count less for
Nathaniel™s next recommendation.

Lessons Learned
Memory based reasoning is a powerful data mining technique that can be used
to solve a wide variety of data mining problems involving classification or
estimation. Unlike other data mining techniques that use a training set of pre-
classified data to create a model and then discard the training set, for MBR, the
training set essentially is the model.
Choosing the right training set is perhaps the most important step in MBR.
The training set needs to include sufficient numbers of examples all possible
classifications. This may mean enriching it by including a disproportionate
number of instances for rare classifications in order to create a balanced train­
ing set with roughly the same number of instances for all categories. A training
set that includes only instances of bad customers will predict that all cus­
tomers are bad. In general, the size of the training set should have at least thou­
sands, if not hundreds of thousands or millions, of examples.
MBR is a k-nearest neighbors approach. Determining which neighbors are
near requires a distance function. There are many approaches to measuring the
distance between two records. The careful choice of an appropriate distance
function is a critical step in using MBR. The chapter introduced an approach to
creating an overall distance function by building a distance function for each
field and normalizing it. The normalized field distances can then be combined
in a Euclidean fashion or summed to produce a Manhattan distance.
When the Euclidean method is used, a large difference in any one field is
enough to cause two records to be considered far apart. The Manhattan method
is more forgiving”a large difference on one field can more easily be offset by
close values on other fields. A validation set can be used to pick the best dis­
tance function for a given model set by applying all candidates to see which
286 Chapter 8

produces better results. Sometimes, the right choice of neighbors depends on
modifying the distance function to favor some fields over others. This is easily
accomplished by incorporating weights into the distance function.
The next question is the number of neighbors to choose. Once again, inves­
tigating different numbers of neighbors using the validation set can help
determine the optimal number. There is no right number of neighbors. The
number depends on the distribution of the data and is highly dependent on
the problem being solved.
The basic combination function, weighted voting, does a good job for cate­
gorical data, using weights inversely proportional to distance. The analogous
operation for estimating numeric values is a weighted average.
One good application for memory based reasoning is making recommenda­
tions. Collaborative filtering is an approach to making recommendations that
works by grouping people with similar tastes together using a distance func­
tion that can compare two lists user-supplied ratings. Recommendations for a
new person are calculated using a weighted average of the ratings of his or her
nearest neighbors.

Market Basket Analysis
and Association Rules

To convey the fundamental ideas of market basket analysis, start with the
image of the shopping cart in Figure 9.1 filled with various products pur­
chased by someone on a quick trip to the supermarket. This basket contains an
assortment of products”orange juice, bananas, soft drink, window cleaner,
and detergent. One basket tells us about what one customer purchased at one
time. A complete list of purchases made by all customers provides much more
information; it describes the most important part of a retailing business”what
merchandise customers are buying and when.
Each customer purchases a different set of products, in different quantities,
at different times. Market basket analysis uses the information about what cus­
tomers purchase to provide insight into who they are and why they make cer­
tain purchases. Market basket analysis provides insight into the merchandise
by telling us which products tend to be purchased together and which are
most amenable to promotion. This information is actionable: it can suggest
new store layouts; it can determine which products to put on special; it can
indicate when to issue coupons, and so on. When this data can be tied to indi­
vidual customers through a loyalty card or Web site registration, it becomes
even more valuable.
The data mining technique most closely allied with market basket analysis
is the automatic generation of association rules. Association rules represent
patterns in the data without a specified target. As such, they are an example of
undirected data mining. Whether the patterns make sense is left to human
288 Chapter 9

In this shopping basket, the shopper purchased
a quart of orange juice, some bananas, dish
detergent, some window cleaner, and a six
pack of soda.

Is soda typically purchased with
How do the
bananas? Does the brand of soda
demographics of the
make a difference?
neighborhood affect
what customers buy?

What should be in the
basket but is not?
Are window cleaning products
purchased when detergent and orange
juice are bought together?
Figure 9.1 Market basket analysis helps you understand customers as well as items that
are purchased together.

Association rules were originally derived from point-of-sale data that
describes what products are purchased together. Although its roots are in ana­
lyzing point-of-sale transactions, association rules can be applied outside the
retail industry to find relationships among other types of “baskets.” Some
examples of potential applications are:
Items purchased on a credit card, such as rental cars and hotel rooms,

provide insight into the next product that customers are likely to
Optional services purchased by telecommunications customers (call

waiting, call forwarding, DSL, speed call, and so on) help determine
how to bundle these services together to maximize revenue.
Banking services used by retail customers (money market accounts,

CDs, investment services, car loans, and so on) identify customers
likely to want other services.
Unusual combinations of insurance claims can be a sign of fraud and

can spark further investigation.
Medical patient histories can give indications of likely complications

based on certain combinations of treatments.
Association rules often fail to live up to expectations. In our experience,
for instance, they are not a good choice for building cross-selling models in
Market Basket Analysis and Association Rules 289

industries such as retail banking, because the rules end up describing previous
marketing promotions. Also, in retail banking, customers typically start with a
checking account and then a savings account. Differentiation among products
does not appear until customers have more products. This chapter covers the
pitfalls as well as the uses of association rules.
The chapter starts with an overview of market basket analysis, including
more basic analyses of market basket data that do not require association rules.
It then dives into association rules, explaining how they are derived. The chap­
ter then continues with ways to extend association rules to include other facets
of the market basket analysis.

Defining Market Basket Analysis
Market basket analysis does not refer to a single technique; it refers to a set of
business problems related to understanding point-of-sale transaction data.
The most common technique is association rules, and much of this chapter
delves into that subject. Before talking about association rules, this section
talks about market basket data.

Three Levels of Market Basket Data
Market basket data is transaction data that describes three fundamentally
different entities:

Orders (also called purchases or baskets or, in academic papers, item sets)


In a relational database, the data structure for market basket data often
looks similar to Figure 9.2. This data structure includes four important entities.



. 58
( 137 .)