––

tions occurred relative to each other

Identifying information, such as account number, household ID, or cus

––

tomer ID that identifies different transactions as belonging to the same

customer or household (sometimes called an economic marketing unit)

Market Basket Analysis and Association Rules 319

Building sequential rules is similar to the process of building association

rules:

1. All items purchased by a customer are treated as a single order, and

each item retains the timestamp indicating when it was purchased.

2. The process is the same for finding groups of items that appear

together.

3. To develop the rules, only rules where the items on the left-hand side

were purchased before items on the right-hand side are considered.

The result is a set of association rules that can reveal sequential patterns.

Lessons Learned

Market basket data describes what customers purchase. Analyzing this data is

complex, and no single technique is powerful enough to provide all the

answers. The data itself typically describes the market basket at three different

levels. The order is the event of the purchase; the line-items are the items in the

purchase, and the customer connects orders together over time.

Many important questions about customer behavior can be answered by

looking at product sales over time. Which are the best selling items? Which

items that sold well last year are no longer selling well this year? Inventory

curves do not require transaction level data. Perhaps the most important

insight they provide is the effect of marketing interventions”did sales go up

or down after a particular event?

However, inventory curves are not sufficient for understanding relation

ships among items in a single basket. One technique that is quite powerful is

association rules. This technique finds products that tend to sell together in

groups. Sometimes is the groups are sufficient for insight. Other times, the

groups are turned into explicit rules”when certain items are present then we

expect to find certain other items in the basket.

There are three measures of association rules. Support tells how often the

rule is found in the transaction data. Confidence says how often when the “if”

part is true that the “then” part is also true. And, lift tells how much better the

rule is at predicting the “then” part as compared to having no rule at all.

The rules so generated fall into three categories. Useful rules explain a rela

tionship that was perhaps unexpected. Trivial rules explain relationships that

are known (or should be known) to exist. And inexplicable rules simply do not

make sense. Inexplicable rules often have weak support.

320 Chapter 9

Market basket analysis and association rules provide ways to analyze item-

level detail, where the relationships between items are determined by the

baskets they fall into. In the next chapter, we™ll turn to link analysis, which

generalizes the ideas of “items” linked by “relationships,” using the back

ground of an area of mathematics called graph theory.

CHAPTER

10

Link Analysis

The international route maps of British Airways and Air France offer more

than just trip planning help. They also provide insights into the history and

politics of their respective homelands and of lost empires. A traveler bound

from New York to Mombasa changes planes at Heathrow; one bound for

Abidjan changes at Charles de Gaul. The international route maps show how

much information can be gained from knowing how things are connected.

Which Web sites link to which other ones? Who calls whom on the tele

phone? Which physicians prescribe which drugs to which patients? These

relationships are all visible in data, and they all contain a wealth of informa

tion that most data mining techniques are not able to take direct advantage of.

In our ever-more-connected world (where, it has been claimed, there are no

more than six degrees of separation between any two people on the planet),

understanding relationships and connections is critical. Link analysis is the

data mining technique that addresses this need.

Link analysis is based on a branch of mathematics called graph theory. This

chapter reviews the key notions of graphs, then shows how link analysis has

been applied to solve real problems. Link analysis is not applicable to all types

of data nor can it solve all types of problems. However, when it can be used, it

321

322 Chapter 10

often yields very insightful and actionable results. Some areas where it has

yielded good results are:

Identifying authoritative sources of information on the World Wide

––

Web by analyzing the links between its pages

Analyzing telephone call patterns to identify particular market seg

––

ments such as people working from home

Understanding physician referral patterns; a referral is a relationship

––

between two physicians, once again, naturally susceptible to link analysis

Even where links are explicitly recorded, assembling them into a useful

graph can be a data-processing challenge. Links between Web pages are

encoded in the HTML of the pages themselves. Links between telephones

Y

are recorded in call detail records. Neither of these data sources is useful for

FL

link analysis without considerable preprocessing, however. In other cases, the

links are implicit and part of the data mining challenge is to recognize them.

The chapter begins with a brief introduction to graph theory and some of

AM

the classic problems that it has been used to solve. It then moves on to appli

cations in data mining such as search engine rankings and analysis of call

detail records.

TE

Basic Graph Theory

Graphs are an abstraction developed specifically to represent relationships.

They have proven very useful in both mathematics and computer science for

developing algorithms that exploit these relationships. Fortunately, graphs are

quite intuitive, and there is a wealth of examples that illustrate how to take

advantage of them.

A graph consists of two distinct parts:

Nodes (sometimes called vertices) are the things in the graph that have

––

relationships. These have names and often have additional useful

properties.

Edges are pairs of nodes connected by a relationship. An edge is repre

––

sented by the two nodes that it connects, so (A, B) or AB represents the

edge that connects A and B. An edge might also have a weight in a

weighted graph.

Figure 10.1 illustrates two graphs. The graph on the left has four nodes con

nected by six edges and has the property that there is an edge between every

pair of nodes. Such a graph is said to be fully connected. It could be represent

ing daily flights between Atlanta, New York, Cincinnati, and Salt Lake City on

an airline where these four cities serve as regional hubs. It could also represent

Team-Fly®

Link Analysis 323

four people, all of whom know each other, or four mutually related leads for a

criminal investigation. The graph on the right has one node in the center con

nected to four other nodes. This could represent daily flights connecting

Atlanta to Birmingham, Greenville, Charlotte, and Savannah on an airline that

serves the Southeast from a hub in Atlanta, or a restaurant frequented by four

credit card customers. The graph itself captures the information about what is

connected to what. Without any labels, it can describe many different situa

tions. This is the power of abstraction.

A few points of terminology about graphs. Because graphs are so useful for

visualizing relationships, it is nice when the nodes and edges can be drawn

with no intersecting edges. The graphs in Figure 10.2 have this property. They

are planar graphs, since they can be drawn on a sheet of paper (what mathe

maticians call a plane) without having any edges intersect. Figure 10.2 shows

two graphs that cannot be drawn without having at least two edges cross.

There is, in fact, a theorem in graph theory that says that if a graph is nonpla

nar, then lurking inside it is one of the two previously described graphs.

When a path exists between any two nodes in a graph, the graph is said to

be connected. For the rest of this chapter, we assume that all graphs are con

nected, unless otherwise specified. A path, as its name implies, is an ordered

sequence of nodes connected by edges. Consider a graph where each node

represents a city, and the edges are flights between pairs of cities. On such a

graph, a node is a city and an edge is a flight segment”two cities that are con

nected by a nonstop flight. A path is an itinerary of flight segments that go

from one city to another, such as from Greenville, South Carolina to Atlanta,

from Atlanta to Chicago, and from Chicago to Peoria.

A fully connected graph with A graph with five nodes

four nodes and six edges. In and four edges.

a fully connected graph, there

is an edge between every pair

of nodes.

Figure 10.1 Two examples of graphs.

324 Chapter 10

Oops! These edges

intersect.

Three nodes cannot connect A fully-connected graph

to three other nodes without with five nodes must also

two edges crossing over have edges that intersect.

each other.

Figure 10.2 Not all graphs can be drawn without having some edges cross over each other.

Figure 10.3 is an example of a weighted graph, one in which the edges have

weights associated with them. In this case, the nodes represent products pur

chased by customers. The weights on the edges represent the support for the

association, the percentage of market baskets containing both products. Such

graphs provide an approach for solving problems in market basket analysis

and are also a useful means of visualizing market basket data. This product

association graph is an example of an undirected graph. The graph shows that

22.12 percent of market baskets at this health food grocery contain both yellow

peppers and bananas. By itself, this does not explain whether yellow pepper

sales drive banana sales or vice versa, or whether something else drives the

purchase of all yellow fruits and vegetables.

One very common problem in link analysis is finding the shortest path

between two nodes. Which is shortest, though, depends on the weights

assigned to the edges. Consider the graph of flights between cities. Does short

est refer to distance? To the fewest number of flight segments? To the shortest

flight time? Or to the least expensive? All these questions are answered the

same way using graphs”the only difference is the weights on the edges.

The following two sections describe two classic problems in graph theory

that illustrate the power of graphs to represent and solve problems. Few data

mining problems are exactly like these two problems, but the problems give a

flavor of how the simple construction of graphs leads to some interesting solu

tions. They are presented to familiarize the reader with graphs by providing

examples of key concepts in graph theory and to provide a stronger basis for

discussing link analysis.

Link Analysis 325

Bananas

Red Leaf

8

Vine Tomatoes

8. 5

3.25

7

Red Peppers

3. Organic Peaches

3

6. 6

5.

21

3.35