<<

. 64
( 137 .)



>>

the best rule has only two items. In the case of pizza, these toppings are all
fairly common, so are not pruned individually. If anchovies were included in
the analysis”and there are only 15 pizzas containing them out of the 2,000”
then a minimum support of 10 percent, or even 1 percent, would eliminate
anchovies during the first pass.
The best choice for minimum support depends on the data and the situa­
tion. It is also possible to vary the minimum support as the algorithm pro­
gresses. For instance, using different levels at different stages you can find
uncommon combinations of common items (by decreasing the support level
for successive steps) or relatively common combinations of uncommon items
(by increasing the support level).


The Problem of Big Data
A typical fast food restaurant offers several dozen items on its menu, say 100.
To use probabilities to generate association rules, counts have to be calculated
for each combination of items. The number of combinations of a given size
tends to grow exponentially. A combination with three items might be a small
fries, cheeseburger, and medium Diet Coke. On a menu with 100 items, how
many combinations are there with three different menu items? There are
161,700! This calculation is based on the binomial formula On the other hand,
a typical supermarket has at least 10,000 different items in stock, and more typ­
ically 20,000 or 30,000.
314 Chapter 9

A pizza restaurant has sold 2000 pizzas, of which:
100 are mushroom only, 150 are pepperoni, 200 are extra cheese
400 are mushroom and pepperoni, 300 are mushroom and extra cheese, 200 are pepperoni and extra cheese
100 are mushroom, pepperoni, and extra cheese.
550 have no extra toppings.

We need to calculate the probabilities for all possible combinations of items.

100 + 400 + 300 + 100 = 900 pizzas or 45%
The works
Mushroom and pepperoni
Just mushroom Mushroom and extra cheese

150 + 400 + 200 + 100 = 850 pizzas or 42.5%


200 + 300 + 200 + 100 = 800 pizzas or 40%


400 + 100 = 500 pizzas or 25%


300 + 100 = 400 pizzas or 20%


200 + 100 = 300 pizzas or 15%


100 pizzas or 5%




There are three rules with all three items:

Support = 5%
Confidence = 5% divided by 25% = 0.2
Lift = 20%(100/500) divided by 40%(800/2000) = 0.5


Support = 5%
Confidence = 5% divided by 20% = 0.25
Lift = 25%(100/400) divided by 42.5%(850/2000) = 0.588


Support = 5%
Confidence = 5% divided by 15% = 0.333
Lift = 33.3%(100/300) divided by 45%(900/2000) = 0.74

Support = 25%
The best rule has Confidence = 25% divided by 42.5% = 0.588
only two items: Lift = 55.6%(500/900) divided by 43.5%(200/850) = 1.31
Figure 9.11 This example shows how to count up the frequencies on pizza sales for
market basket analysis.


Calculating the support, confidence, and lift quickly gets out of hand as the
number of items in the combinations grows. There are almost 50 million pos-
sible combinations of two items in the grocery store and over 100 billion com-
binations of three items. Although computers are getting more powerful and
Market Basket Analysis and Association Rules 315


cheaper, it is still very time-consuming to calculate the counts for this number
of combinations. Calculating the counts for five or more items is prohibitively
expensive. The use of product hierarchies reduces the number of items to a
manageable size.
The number of transactions is also very large. In the course of a year, a
decent-size chain of supermarkets will generate tens or hundreds of millions
of transactions. Each of these transactions consists of one or more items, often
several dozen at a time. So, determining if a particular combination of items is
present in a particular transaction may require a bit of effort”multiplied a
million-fold for all the transactions.


Extending the Ideas
The basic ideas of association rules can be applied to different areas, such as
comparing different stores and making some enhancements to the definition
of the rules. These are discussed in this section.


Using Association Rules to Compare Stores
Market basket analysis is commonly used to make comparisons between loca­
tions within a single chain. The rule about toilet bowl cleaner sales in hardware
stores is an example where sales at new stores are compared to sales at existing
stores. Different stores exhibit different selling patterns for many reasons:
regional trends, the effectiveness of management, dissimilar advertising, and
varying demographic patterns in the catchment area, for example. Air condi­
tioners and fans are often purchased during heat waves, but heat waves affect
only a limited region. Within smaller areas, demographics of the catchment
area can have a large impact; we would expect stores in wealthy areas to exhibit
different sales patterns from those in poorer neighborhoods. These are exam­
ples where market basket analysis can help to describe the differences and
serve as an example of using market basket analysis for directed data mining.
How can association rules be used to make these comparisons? The first
step is augmenting the transactions with virtual items that specify which
group, such as an existing location or a new location, that the transaction
comes from. Virtual items help describe the transaction, although the virtual
item is not a product or service. For instance, a sale at an existing hardware
store might include the following products:
A hammer
––

A box of nails
––

Extra-fine sandpaper
––
316 Chapter 9


T I P Adding virtual transactions in to the market basket data makes it possible
to find rules that include store characteristics and customer characteristics.

After augmenting the data to specify where it came from, the transaction
looks like:
a hammer,

a box of nails,

extra fine sandpaper,

“at existing hardware store.”

To compare sales at store openings versus existing stores, the process is:
1. Gather data for a specific period (such as 2 weeks) from store openings.
Augment each of the transactions in this data with a virtual item saying
that the transaction is from a store opening.
2. Gather about the same amount of data from existing stores. Here you
might use a sample across all existing stores, or you might take all the
data from stores in comparable locations. Augment the transactions in
this data with a virtual item saying that the transaction is from an exist­
ing store.
3. Apply market basket analysis to find association rules in each set.
4. Pay particular attention to association rules containing the virtual items.
Because association rules are undirected data mining, the rules act as start­
ing points for further hypothesis testing. Why does one pattern exist at exist­
ing stores and another at new stores? The rule about toilet bowl cleaners and
store openings, for instance, suggests looking more closely at toilet bowl
cleaner sales in existing stores at different times during the year.
Using this technique, market basket analysis can be used for many other
types of comparisons:
Sales during promotions versus sales at other times
––

Sales in various geographic areas, by county, standard statistical metro­
––

politan area (SSMA), direct marketing area (DMA), or country

Urban versus suburban sales

––

Seasonal differences in sales patterns

––


Adding virtual items to each basket of goods enables the standard associa­
tion rule techniques to make these comparisons.
Market Basket Analysis and Association Rules 317


Dissociation Rules
A dissociation rule is similar to an association rule except that it can have the
connector “and not” in the condition in addition to “and.” A typical dissocia­
tion rule looks like:
if A and not B, then C.
Dissociation rules can be generated by a simple adaptation of the basic mar­
ket basket analysis algorithm. The adaptation is to introduce a new set of items
that are the inverses of each of the original items. Then, modify each transaction
so it includes an inverse item if, and only if, it does not contain the original item.
For example, Table 9.8 shows the transformation of a few transactions. The ¬
before the item denotes the inverse item.
There are three downsides to including these new items. First, the total
number of items used in the analysis doubles. Since the amount of computa­
tion grows exponentially with the number of items, doubling the number of
items seriously degrades performance. Second, the size of a typical transaction
grows because it now includes inverted items. The third issue is that the fre­
quency of the inverse items tends to be much larger than the frequency of the
original items. So, minimum support constraints tend to produce rules in
which all items are inverted, such as:
if NOT A and NOT B then NOT C.
These rules are less likely to be actionable.
Sometimes it is useful to invert only the most frequent items in the set used
for analysis. This is particularly valuable when the frequency of some of the
original items is close to 50 percent, so the frequencies of their inverses are also
close to 50 percent.

Table 9.8 Transformation of Transactions to Generate Dissociation Rules

CUSTOMER ITEMS CUSTOMER WITH INVERSE ITEMS
1 {A, B, C} 1 {A, B, C}

2 {A} 2 {A, ¬B, ¬C}

3 {A, C} 3 {A, ¬B, C}

4 {A} 4 {A, ¬B, ¬C}

5 {} 5 {¬A, ¬B, ¬C}
318 Chapter 9


Sequential Analysis Using Association Rules

Association rules find things that happen at the same time”what items are
purchased at a given time. The next natural question concerns sequences of
events and what they mean. Examples of results in this area are:
New homeowners purchase shower curtains before purchasing furniture.
––


Customers who purchase new lawnmowers are very likely to purchase
––

a new garden hose in the following 6 weeks.
When a customer goes into a bank branch and asks for an account rec­
––

onciliation, there is a good chance that he or she will close all his or her
accounts.
Time-series data usually requires some way of identifying the customer
over time. Anonymous transactions cannot reveal that new homeowners buy
shower curtains before they buy furniture. This requires tracking each cus­
tomer, as well as knowing which customers recently purchased a home. Since
larger purchases are often made with credit cards or debit cards, this is less of
a problem. For problems in other domains, such as investigating the effects of
medical treatments or customer behavior inside a bank, all transactions typi­
cally include identity information.

WA R N I N G In order to consider time-series analyses on your customers,
there has to be some way of identifying customers. Without a way of tracking
individual customers, there is no way to analyze their behavior over time.

For the purposes of this section, a time series is an ordered sequence of items.
It differs from a transaction only in being ordered. In general, the time series
contains identifying information about the customer, since this information is
used to tie the different transactions together into a series. Although there are
many techniques for analyzing time series, such as ARIMA (a statistical tech­
nique) and neural networks, this section discusses only how to manipulate the
time-series data to apply the market basket analysis.
In order to use time series, the transaction data must have two additional
features:

<<

. 64
( 137 .)



>>