. 63
( 137 .)


Data Quality
The data used for market basket analysis is generally not of very high quality.
It is gathered directly at the point of customer contact and used mainly for
operational purposes such as inventory control. The data is likely to have mul­
tiple formats, corrections, incompatible code types, and so on. Much of the
explanation of various code values is likely to be buried deep in programming
code running in legacy systems and may be difficult to extract. Different stores
within a single chain sometimes have slightly different product hierarchies or
different ways of handling situations like discounts.
Here is an example. The authors were once curious about the approximately
80 department codes present in a large set of transaction data. The client
assured us that there were 40 departments and provided a nice description of
each of them. More careful inspection revealed the problem. Some stores had
IBM cash registers and others had NCR. The two types of equipment had dif­
ferent ways of representing department codes”hence we saw many invalid
codes in the data.
These kinds of problems are typical when using any sort of data for data min­
ing. However, they are exacerbated for market basket analysis because this type
of analysis depends heavily on the unsummarized point-of-sale transactions.

Anonymous versus Identified
Market basket analysis has proven useful for mass-market retail, such as
supermarkets, convenience stores, drug stores, and fast food chains, where
many of the purchases have traditionally been made with cash. Cash transac­
tions are anonymous, meaning that the store has no knowledge about specific
customers because there is no information identifying the customer in the
transaction. For anonymous transactions, the only information is the date and
time, the location of the store, the cashier, the items purchased, any coupons
redeemed, and the amount of change. With market basket analysis, even this
limited data can yield interesting and actionable results.
The increasing prevalence of Web transactions, loyalty programs, and pur­
chasing clubs is resulting in more and more identified transactions, providing
analysts with more possibilities for information about customers and their
behavior over time. Demographic and trending information is available on
individuals and households to further augment customer profiles. This addi­
tional information can be incorporated into association rule analysis using vir­
tual items.

Generating Rules from All This Data
Calculating the number of times that a given combination of items appears in
the transaction data is well and good, but a combination of items is not a rule.
Market Basket Analysis and Association Rules 309

Sometimes, just the combination is interesting in itself, as in the Barbie doll
and candy bar example. But in other circumstances, it makes more sense to
find an underlying rule of the form:
if condition, then result.
Notice that this is just shorthand. If the rule says,
if Barbie doll, then candy bar
then we read it as: “if a customer purchases a Barbie doll, then the customer is
also expected to purchase a candy bar.” The general practice is to consider
rules where there is just one item on the right-hand side.

Calculating Confidence
Constructs such as the co-occurrence table provide information about which
combinations of items occur most commonly in the transactions. For the sake
of illustration, let™s say that the most common combination has three items, A,
B, and C. Table 9.5 provides an example, showing the probabilities that items
and various combinations are purchased.
The only rules to consider are those with all three items in the rule and with
exactly one item in the result:
If A and B, then C

If A and C, then B

If B and C, then A

Because these three rules contain the same items, they have the same sup­
port in the data, 5 percent. What about their confidence level? Confidence is
the ratio of the number of transactions with all the items in the rule to the num­
ber of transactions with just the items in the condition. The confidence for the
three rules is shown in Table 9.6.

Table 9.5 Probabilities of Three Items and Their Combinations


A 45.0 %

B 42.5%
C 40.0%

A and B 25.0 %

A and C 20.0 %

B and C 15.0%

A and B and C 5.0%
310 Chapter 9

Table 9.6 Confidence in Rules


If A 25% 5% 0.20
and B
then C

If A 20% 5% 0.25
and C
then B

If B 15% 5% 0.33
and C
then A

What is confidence really saying? Saying that the rule “if B and C then A” has
a confidence of 0.33 is equivalent to saying that when B and C appear in a
transaction, there is a 33 percent chance that A also appears in it. That is, one
time in three A occurs with B and C, and the other two times, B and C appear
without A. The most confident rule is the best rule, so the best rule is “if B and
C then A.”

Calculating Lift
As described earlier, lift is a good measure of how much better the rule is
doing. It is the ratio of the density of the target (using the left hand side of the
rule) to density of the target overall. So the formula is:
lift = (p(condition and result) / p (condition) ) / p(result)

= p(condition and result) / (p(condition) p(result))

When lift is greater than 1, then the resulting rule is better at predicting the
result than guessing whether the resultant item is present based on item fre­
quencies in the data. When lift is less than 1, the rule is doing worse than
informed guessing. The following table (Table 9.7) shows the lift for the three
rules and for the rule with the best lift.
None of the rules with three items shows improved lift. The best rule in the
data actually only has two items. When “A” is purchased, then “B” is 31 per­
cent more likely to be in the transaction than if “A” is not purchased. In this
case, as in many cases, the best rule actually contains fewer items than other
rules being considered.
Market Basket Analysis and Association Rules 311

Table 9.7 Lift Measurements for Four Rules


If A 5% 0.20 40% 0.50
and B
then C

If A 5% 0.25 42.5% 0.59
and C
then B

If B 5% 0.33 45% 0.74
and C
then A

If A 25% 0.59 42.5% 1.31
then B

The Negative Rule
When lift is less than 1, negating the result produces a better rule. If the rule
if B and C then A
has a confidence of 0.33, then the rule
if B and C then NOT A
has a confidence of 0.67. Since A appears in 45 percent of the transactions, it
does NOT occur in 55 percent of them. Applying the same lift measure shows
that the lift of this new rule is 1.22 (0.67/0.55), resulting in a lift of 1.33, better
than any of the other rules.

Overcoming Practical Limits
Generating association rules is a multistep process. The general algorithm is:
1. Generate the co-occurrence matrix for single items.
2. Generate the co-occurrence matrix for two items. Use this to find rules
with two items.
3. Generate the co-occurrence matrix for three items. Use this to find rules
with three items.
4. And so on.
312 Chapter 9

For instance, in the grocery store that sells orange juice, milk, detergent,
soda, and window cleaner, the first step calculates the counts for each of these
items. During the second step, the following counts are created:
Milk and detergent, milk and soda, milk and cleaner

Detergent and soda, detergent and cleaner

Soda and cleaner

This is a total of 10 pairs of items. The third pass takes all combinations of
three items and so on. Of course, each of these stages may require a separate
pass through the data or multiple stages can be combined into a single pass by
considering different numbers of combinations at the same time.
Although it is not obvious when there are just five items, increasing the

number of items in the combinations requires exponentially more computa­

tion. This results in exponentially growing run times”and long, long waits
when considering combinations with more than three or four items. The solu­
tion is pruning. Pruning is a technique for reducing the number of items and
combinations of items being considered at each step. At each stage, the algo­
rithm throws out a certain number of combinations that do not meet some
threshold criterion.

The most common pruning threshold is called minimum support pruning.
Support refers to the number of transactions in the database where the rule
holds. Minimum support pruning requires that a rule hold on a minimum
number of transactions. For instance, if there are one million transactions and
the minimum support is 1 percent, then only rules supported by 10,000 trans­
actions are of interest. This makes sense, because the purpose of generating
these rules is to pursue some sort of action”such as striking a deal with
Mattel (the makers of Barbie dolls) to make a candy-bar-eating doll”and the
action must affect enough transactions to be worthwhile.
The minimum support constraint has a cascading effect. Consider a rule
with four items in it:
if A, B, and C, then D.
Using minimum support pruning, this rule has to be true on at least 10,000
transactions in the data. It follows that:
A must appear in at least 10,000 transactions, and,

B must appear in at least 10,000 transactions, and,

C must appear in at least 10,000 transactions, and,

D must appear in at least 10,000 transactions.

Market Basket Analysis and Association Rules 313

In other words, minimum support pruning eliminates items that do not
appear in enough transactions. The threshold criterion applies to each step in
the algorithm. The minimum threshold also implies that:
A and B must appear together in at least 10,000 transactions, and,
A and C must appear together in at least 10,000 transactions, and,
A and D must appear together in at least 10,000 transactions,
and so on.
Each step of the calculation of the co-occurrence table can eliminate combi­
nations of items that do not meet the threshold, reducing its size and the num­
ber of combinations to consider during the next pass.
Figure 9.11 is an example of how the calculation takes place. In this example,
choosing a minimum support level of 10 percent would eliminate all the com­
binations with three items”and their associated rules”from consideration.
This is an example where pruning does not have an effect on the best rule since


. 63
( 137 .)