fairly common, so are not pruned individually. If anchovies were included in

the analysis”and there are only 15 pizzas containing them out of the 2,000”

then a minimum support of 10 percent, or even 1 percent, would eliminate

anchovies during the first pass.

The best choice for minimum support depends on the data and the situa

tion. It is also possible to vary the minimum support as the algorithm pro

gresses. For instance, using different levels at different stages you can find

uncommon combinations of common items (by decreasing the support level

for successive steps) or relatively common combinations of uncommon items

(by increasing the support level).

The Problem of Big Data

A typical fast food restaurant offers several dozen items on its menu, say 100.

To use probabilities to generate association rules, counts have to be calculated

for each combination of items. The number of combinations of a given size

tends to grow exponentially. A combination with three items might be a small

fries, cheeseburger, and medium Diet Coke. On a menu with 100 items, how

many combinations are there with three different menu items? There are

161,700! This calculation is based on the binomial formula On the other hand,

a typical supermarket has at least 10,000 different items in stock, and more typ

ically 20,000 or 30,000.

314 Chapter 9

A pizza restaurant has sold 2000 pizzas, of which:

100 are mushroom only, 150 are pepperoni, 200 are extra cheese

400 are mushroom and pepperoni, 300 are mushroom and extra cheese, 200 are pepperoni and extra cheese

100 are mushroom, pepperoni, and extra cheese.

550 have no extra toppings.

We need to calculate the probabilities for all possible combinations of items.

100 + 400 + 300 + 100 = 900 pizzas or 45%

The works

Mushroom and pepperoni

Just mushroom Mushroom and extra cheese

150 + 400 + 200 + 100 = 850 pizzas or 42.5%

200 + 300 + 200 + 100 = 800 pizzas or 40%

400 + 100 = 500 pizzas or 25%

300 + 100 = 400 pizzas or 20%

200 + 100 = 300 pizzas or 15%

100 pizzas or 5%

There are three rules with all three items:

Support = 5%

Confidence = 5% divided by 25% = 0.2

Lift = 20%(100/500) divided by 40%(800/2000) = 0.5

Support = 5%

Confidence = 5% divided by 20% = 0.25

Lift = 25%(100/400) divided by 42.5%(850/2000) = 0.588

Support = 5%

Confidence = 5% divided by 15% = 0.333

Lift = 33.3%(100/300) divided by 45%(900/2000) = 0.74

Support = 25%

The best rule has Confidence = 25% divided by 42.5% = 0.588

only two items: Lift = 55.6%(500/900) divided by 43.5%(200/850) = 1.31

Figure 9.11 This example shows how to count up the frequencies on pizza sales for

market basket analysis.

Calculating the support, confidence, and lift quickly gets out of hand as the

number of items in the combinations grows. There are almost 50 million pos-

sible combinations of two items in the grocery store and over 100 billion com-

binations of three items. Although computers are getting more powerful and

Market Basket Analysis and Association Rules 315

cheaper, it is still very time-consuming to calculate the counts for this number

of combinations. Calculating the counts for five or more items is prohibitively

expensive. The use of product hierarchies reduces the number of items to a

manageable size.

The number of transactions is also very large. In the course of a year, a

decent-size chain of supermarkets will generate tens or hundreds of millions

of transactions. Each of these transactions consists of one or more items, often

several dozen at a time. So, determining if a particular combination of items is

present in a particular transaction may require a bit of effort”multiplied a

million-fold for all the transactions.

Extending the Ideas

The basic ideas of association rules can be applied to different areas, such as

comparing different stores and making some enhancements to the definition

of the rules. These are discussed in this section.

Using Association Rules to Compare Stores

Market basket analysis is commonly used to make comparisons between loca

tions within a single chain. The rule about toilet bowl cleaner sales in hardware

stores is an example where sales at new stores are compared to sales at existing

stores. Different stores exhibit different selling patterns for many reasons:

regional trends, the effectiveness of management, dissimilar advertising, and

varying demographic patterns in the catchment area, for example. Air condi

tioners and fans are often purchased during heat waves, but heat waves affect

only a limited region. Within smaller areas, demographics of the catchment

area can have a large impact; we would expect stores in wealthy areas to exhibit

different sales patterns from those in poorer neighborhoods. These are exam

ples where market basket analysis can help to describe the differences and

serve as an example of using market basket analysis for directed data mining.

How can association rules be used to make these comparisons? The first

step is augmenting the transactions with virtual items that specify which

group, such as an existing location or a new location, that the transaction

comes from. Virtual items help describe the transaction, although the virtual

item is not a product or service. For instance, a sale at an existing hardware

store might include the following products:

A hammer

––

A box of nails

––

Extra-fine sandpaper

––

316 Chapter 9

T I P Adding virtual transactions in to the market basket data makes it possible

to find rules that include store characteristics and customer characteristics.

After augmenting the data to specify where it came from, the transaction

looks like:

a hammer,

a box of nails,

extra fine sandpaper,

“at existing hardware store.”

To compare sales at store openings versus existing stores, the process is:

1. Gather data for a specific period (such as 2 weeks) from store openings.

Augment each of the transactions in this data with a virtual item saying

that the transaction is from a store opening.

2. Gather about the same amount of data from existing stores. Here you

might use a sample across all existing stores, or you might take all the

data from stores in comparable locations. Augment the transactions in

this data with a virtual item saying that the transaction is from an exist

ing store.

3. Apply market basket analysis to find association rules in each set.

4. Pay particular attention to association rules containing the virtual items.

Because association rules are undirected data mining, the rules act as start

ing points for further hypothesis testing. Why does one pattern exist at exist

ing stores and another at new stores? The rule about toilet bowl cleaners and

store openings, for instance, suggests looking more closely at toilet bowl

cleaner sales in existing stores at different times during the year.

Using this technique, market basket analysis can be used for many other

types of comparisons:

Sales during promotions versus sales at other times

––

Sales in various geographic areas, by county, standard statistical metro

––

politan area (SSMA), direct marketing area (DMA), or country

Urban versus suburban sales

––

Seasonal differences in sales patterns

––

Adding virtual items to each basket of goods enables the standard associa

tion rule techniques to make these comparisons.

Market Basket Analysis and Association Rules 317

Dissociation Rules

A dissociation rule is similar to an association rule except that it can have the

connector “and not” in the condition in addition to “and.” A typical dissocia

tion rule looks like:

if A and not B, then C.

Dissociation rules can be generated by a simple adaptation of the basic mar

ket basket analysis algorithm. The adaptation is to introduce a new set of items

that are the inverses of each of the original items. Then, modify each transaction

so it includes an inverse item if, and only if, it does not contain the original item.

For example, Table 9.8 shows the transformation of a few transactions. The ¬

before the item denotes the inverse item.

There are three downsides to including these new items. First, the total

number of items used in the analysis doubles. Since the amount of computa

tion grows exponentially with the number of items, doubling the number of

items seriously degrades performance. Second, the size of a typical transaction

grows because it now includes inverted items. The third issue is that the fre

quency of the inverse items tends to be much larger than the frequency of the

original items. So, minimum support constraints tend to produce rules in

which all items are inverted, such as:

if NOT A and NOT B then NOT C.

These rules are less likely to be actionable.

Sometimes it is useful to invert only the most frequent items in the set used

for analysis. This is particularly valuable when the frequency of some of the

original items is close to 50 percent, so the frequencies of their inverses are also

close to 50 percent.

Table 9.8 Transformation of Transactions to Generate Dissociation Rules

CUSTOMER ITEMS CUSTOMER WITH INVERSE ITEMS

1 {A, B, C} 1 {A, B, C}

2 {A} 2 {A, ¬B, ¬C}

3 {A, C} 3 {A, ¬B, C}

4 {A} 4 {A, ¬B, ¬C}

5 {} 5 {¬A, ¬B, ¬C}

318 Chapter 9

Sequential Analysis Using Association Rules

Association rules find things that happen at the same time”what items are

purchased at a given time. The next natural question concerns sequences of

events and what they mean. Examples of results in this area are:

New homeowners purchase shower curtains before purchasing furniture.

––

Customers who purchase new lawnmowers are very likely to purchase

––

a new garden hose in the following 6 weeks.

When a customer goes into a bank branch and asks for an account rec

––

onciliation, there is a good chance that he or she will close all his or her

accounts.

Time-series data usually requires some way of identifying the customer

over time. Anonymous transactions cannot reveal that new homeowners buy

shower curtains before they buy furniture. This requires tracking each cus

tomer, as well as knowing which customers recently purchased a home. Since

larger purchases are often made with credit cards or debit cards, this is less of

a problem. For problems in other domains, such as investigating the effects of

medical treatments or customer behavior inside a bank, all transactions typi

cally include identity information.

WA R N I N G In order to consider time-series analyses on your customers,

there has to be some way of identifying customers. Without a way of tracking

individual customers, there is no way to analyze their behavior over time.

For the purposes of this section, a time series is an ordered sequence of items.

It differs from a transaction only in being ordered. In general, the time series

contains identifying information about the customer, since this information is

used to tie the different transactions together into a series. Although there are

many techniques for analyzing time series, such as ARIMA (a statistical tech

nique) and neural networks, this section discusses only how to manipulate the

time-series data to apply the market basket analysis.

In order to use time series, the transaction data must have two additional

features: