. 62
( 137 .)


Figure 9.8 A co-occurrence table in three dimensions can be visualized as a cube.

Building Association Rules
This basic process for finding association rules is illustrated in Figure 9.9.
There are three important concerns in creating association rules:
Choosing the right set of items.

Generating rules by deciphering the counts in the co-occurrence matrix.

Overcoming the practical limits imposed by thousands or tens of thou-

sands of items.
The next three sections delve into these concerns in more detail.

Market Basket Analysis and Association Rules 303

First determine the right set
of items and the right level.
For instance, is pizza an item
or are the toppings items?

Topping Probability
Next, calculate the probabilities and
joint probabilities of items and
combinations of interest, perhaps
limiting the search using threshholds
on support or value.

Finally, analyze the probabilities to If mushroom then pepperoni.
determine the right rules.

Figure 9.9 Finding association rules has these basic steps.

Choosing the Right Set of Items
The data used for finding association rules is typically the detailed transaction
data captured at the point of sale. Gathering and using this data is a critical
part of applying market basket analysis, depending crucially on the items cho­
sen for analysis. What constitutes a particular item depends on the business
need. Within a grocery store where there are tens of thousands of products
on the shelves, a frozen pizza might be considered an item for analysis
purposes”regardless of its toppings (extra cheese, pepperoni, or mushrooms),
its crust (extra thick, whole wheat, or white), or its size. So, the purchase of a
large whole wheat vegetarian pizza contains the same “frozen pizza” item as
the purchase of a single-serving, pepperoni with extra cheese. A sample of
such transactions at this summarized level might look like Table 9.3.
304 Chapter 9

Table 9.3 Transactions with More Summarized Items


1 

2  

3   

4  

5    

On the other hand, the manager of frozen foods or a chain of pizza restau­
rants may be very interested in the particular combinations of toppings that
are ordered. He or she might decompose a pizza order into constituent parts,
as shown in Table 9.4.
At some later point in time, the grocery store may become interested in hav­
ing more detail in its transactions, so the single “frozen pizza” item would no
longer be sufficient. Or, the pizza restaurants might broaden their menu
choices and become less interested in all the different toppings. The items of
interest may change over time. This can pose a problem when trying to use
historical data if different levels of detail have been removed.
Choosing the right level of detail is a critical consideration for the analysis.
If the transaction data in the grocery store keeps track of every type, brand,
and size of frozen pizza”which probably account for several dozen
products”then all these items need to map up to the “frozen pizza” item for

Table 9.4 Transactions with More Detailed Items


1   

2 

3   

4  

5    
Market Basket Analysis and Association Rules 305

Product Hierarchies Help to Generalize Items
In the real world, items have product codes and stock-keeping unit codes
(SKUs) that fall into hierarchical categories (see Figure 9.10), called a product
hierarchy or taxonomy. What level of the product hierarchy is the right one to
use? This brings up issues such as
Are large fries and small fries the same product?


Is the brand of ice cream more relevant than its flavor?


Which is more important: the size, style, pattern, or designer of clothing?


Is the energy-saving option on a large appliance indicative of customer



more general

Frozen Frozen Frozen
Desserts Vegetables Dinners
Partial Product Taxonomy

Frozen Frozen
Ice Cream Peas Carrots Mixed Other
Yogurt Fruit Bars

Rocky Cherry
Chocolate Strawberry Vanilla Other
Road Garcia
more detailed

Brands, sizes, and stock keeping units (SKUs)

Figure 9.10 Product hierarchies start with the most general and move to increasing detail.
306 Chapter 9

The number of combinations to consider grows very fast as the number of
items used in the analysis increases. This suggests using items from higher lev­
els of the product hierarchy, “frozen desserts” instead of “ice cream.” On the
other hand, the more specific the items are, the more likely the results are to be
actionable. Knowing what sells with a particular brand of frozen pizza, for
instance, can help in managing the relationship with the manufacturer. One
compromise is to use more general items initially, then to repeat the rule
generation to hone in on more specific items. As the analysis focuses on more
specific items, use only the subset of transactions containing those items.
The complexity of a rule refers to the number of items it contains. The more
items in the transactions, the longer it takes to generate rules of a given com­
plexity. So, the desired complexity of the rules also determines how specific or
general the items should be. In some circumstances, customers do not make
large purchases. For instance, customers purchase relatively few items at any
one time at a convenience store or through some catalogs, so looking for rules
containing four or more items may apply to very few transactions and be a
wasted effort. In other cases, such as in supermarkets, the average transaction
is larger, so more complex rules are useful.
Moving up the product hierarchy reduces the number of items. Dozens or
hundreds of items may be reduced to a single generalized item, often corre­
sponding to a single department or product line. An item like a pint of Ben &
Jerry™s Cherry Garcia gets generalized to “ice cream” or “frozen foods.”
Instead of investigating “orange juice,” investigate “fruit juices,” and so on.
Often, the appropriate level of the hierarchy ends up matching a department
with a product-line manager; so using categories has the practical effect of
finding interdepartmental relationships. Generalized items also help find
rules with sufficient support. There will be many times as many transactions
supported by higher levels of the taxonomy than lower levels.
Just because some items are generalized does not mean that all items need
to move up to the same level. The appropriate level depends on the item, on its
importance for producing actionable results, and on its frequency in the data.
For instance, in a department store, big-ticket items (such as appliances) might
stay at a low level in the hierarchy, while less-expensive items (such as books)
might be higher. This hybrid approach is also useful when looking at individ­
ual products. Since there are often thousands of products in the data, general­
ize everything other than the product or products of interest.

T I P Market basket analysis produces the best results when the items occur in
roughly the same number of transactions in the data. This helps prevent rules
from being dominated by the most common items. Product hierarchies can help
here. Roll up rare items to higher levels in the hierarchy, so they become more
frequent. More common items may not have to be rolled up at all.
Market Basket Analysis and Association Rules 307

Virtual Items Go beyond the Product Hierarchy
The purpose of virtual items is to enable the analysis to take advantage of infor­
mation that goes beyond the product hierarchy. Virtual items do not appear in
the product hierarchy of the original items, because they cross product bound­
aries. Examples of virtual items might be designer labels such as Calvin Klein
that appear in both apparel departments and perfumes, low-fat and no-fat
products in a grocery store, and energy-saving options on appliances.
Virtual items may even include information about the transactions them­
selves, such as whether the purchase was made with cash, a credit card, or
check, and the day of the week or the time of the day the transaction occurred.
However, it is not a good idea to crowd the data with too many virtual items.
Only include virtual items when you have some idea of how they could result in
actionable information if found in well-supported, high-confidence association rules.
There is a danger, though. Virtual items can cause trivial rules. For instance,
imagine that there is a virtual item for “diet product” and one for “coke prod­
uct”, then a rule might appear like:
If “coke product” and “diet product” then “diet coke”
That is, everywhere that <Coke> appears in a basket and <Diet Product>
appears in a basket, then <Diet Coke> also appears. Every basket that has Diet
Coke satisfies this rule. Although some baskets may have regular coke and
other diet products, the rule will have high lift because it is the definition of
Diet Coke. When using virtual items, it is worth checking and rechecking the
rules to be sure that such trivial rules are not arising.
A similar but more subtle danger occurs when the right-hand side does not
include the associated item. So, a rule like:
If “coke product” and “diet product” then “pretzels”
probably means,
If “diet coke” then “pretzels”
The only danger from having such rules is that they can obscure what is

TI P When applying market basket analysis, it is useful to have a hierarchical
taxonomy of the items being considered for analysis. By carefully choosing the
right levels of the hierarchy, these generalized items should occur about the
same number of times in the data, improving the results of the analysis. For
specific lifestyle-related choices that provide insight into customer behavior, such
as sugar-free items and specific brands, augment the data with virtual items.
308 Chapter 9


. 62
( 137 .)