<<

. 65
( 137 .)



>>

A timestamp or sequencing information to determine when transac­
––

tions occurred relative to each other
Identifying information, such as account number, household ID, or cus­
––

tomer ID that identifies different transactions as belonging to the same
customer or household (sometimes called an economic marketing unit)
Market Basket Analysis and Association Rules 319


Building sequential rules is similar to the process of building association
rules:
1. All items purchased by a customer are treated as a single order, and
each item retains the timestamp indicating when it was purchased.
2. The process is the same for finding groups of items that appear

together.

3. To develop the rules, only rules where the items on the left-hand side
were purchased before items on the right-hand side are considered.
The result is a set of association rules that can reveal sequential patterns.


Lessons Learned
Market basket data describes what customers purchase. Analyzing this data is
complex, and no single technique is powerful enough to provide all the
answers. The data itself typically describes the market basket at three different
levels. The order is the event of the purchase; the line-items are the items in the
purchase, and the customer connects orders together over time.
Many important questions about customer behavior can be answered by
looking at product sales over time. Which are the best selling items? Which
items that sold well last year are no longer selling well this year? Inventory
curves do not require transaction level data. Perhaps the most important
insight they provide is the effect of marketing interventions”did sales go up
or down after a particular event?
However, inventory curves are not sufficient for understanding relation­
ships among items in a single basket. One technique that is quite powerful is
association rules. This technique finds products that tend to sell together in
groups. Sometimes is the groups are sufficient for insight. Other times, the
groups are turned into explicit rules”when certain items are present then we
expect to find certain other items in the basket.
There are three measures of association rules. Support tells how often the
rule is found in the transaction data. Confidence says how often when the “if”
part is true that the “then” part is also true. And, lift tells how much better the
rule is at predicting the “then” part as compared to having no rule at all.
The rules so generated fall into three categories. Useful rules explain a rela­
tionship that was perhaps unexpected. Trivial rules explain relationships that
are known (or should be known) to exist. And inexplicable rules simply do not
make sense. Inexplicable rules often have weak support.
320 Chapter 9


Market basket analysis and association rules provide ways to analyze item-
level detail, where the relationships between items are determined by the
baskets they fall into. In the next chapter, we™ll turn to link analysis, which
generalizes the ideas of “items” linked by “relationships,” using the back­
ground of an area of mathematics called graph theory.
CHAPTER

10

Link Analysis




The international route maps of British Airways and Air France offer more
than just trip planning help. They also provide insights into the history and
politics of their respective homelands and of lost empires. A traveler bound
from New York to Mombasa changes planes at Heathrow; one bound for
Abidjan changes at Charles de Gaul. The international route maps show how
much information can be gained from knowing how things are connected.
Which Web sites link to which other ones? Who calls whom on the tele­
phone? Which physicians prescribe which drugs to which patients? These
relationships are all visible in data, and they all contain a wealth of informa­
tion that most data mining techniques are not able to take direct advantage of.
In our ever-more-connected world (where, it has been claimed, there are no
more than six degrees of separation between any two people on the planet),
understanding relationships and connections is critical. Link analysis is the
data mining technique that addresses this need.
Link analysis is based on a branch of mathematics called graph theory. This
chapter reviews the key notions of graphs, then shows how link analysis has
been applied to solve real problems. Link analysis is not applicable to all types
of data nor can it solve all types of problems. However, when it can be used, it




321
322 Chapter 10


often yields very insightful and actionable results. Some areas where it has
yielded good results are:
Identifying authoritative sources of information on the World Wide
––

Web by analyzing the links between its pages
Analyzing telephone call patterns to identify particular market seg­
––

ments such as people working from home
Understanding physician referral patterns; a referral is a relationship
––

between two physicians, once again, naturally susceptible to link analysis
Even where links are explicitly recorded, assembling them into a useful
graph can be a data-processing challenge. Links between Web pages are
encoded in the HTML of the pages themselves. Links between telephones




Y
are recorded in call detail records. Neither of these data sources is useful for




FL
link analysis without considerable preprocessing, however. In other cases, the
links are implicit and part of the data mining challenge is to recognize them.
The chapter begins with a brief introduction to graph theory and some of
AM
the classic problems that it has been used to solve. It then moves on to appli­
cations in data mining such as search engine rankings and analysis of call
detail records.
TE


Basic Graph Theory
Graphs are an abstraction developed specifically to represent relationships.
They have proven very useful in both mathematics and computer science for
developing algorithms that exploit these relationships. Fortunately, graphs are
quite intuitive, and there is a wealth of examples that illustrate how to take
advantage of them.
A graph consists of two distinct parts:
Nodes (sometimes called vertices) are the things in the graph that have
––

relationships. These have names and often have additional useful
properties.
Edges are pairs of nodes connected by a relationship. An edge is repre­
––

sented by the two nodes that it connects, so (A, B) or AB represents the
edge that connects A and B. An edge might also have a weight in a
weighted graph.
Figure 10.1 illustrates two graphs. The graph on the left has four nodes con­
nected by six edges and has the property that there is an edge between every
pair of nodes. Such a graph is said to be fully connected. It could be represent­
ing daily flights between Atlanta, New York, Cincinnati, and Salt Lake City on
an airline where these four cities serve as regional hubs. It could also represent


Team-Fly®
Link Analysis 323


four people, all of whom know each other, or four mutually related leads for a
criminal investigation. The graph on the right has one node in the center con­
nected to four other nodes. This could represent daily flights connecting
Atlanta to Birmingham, Greenville, Charlotte, and Savannah on an airline that
serves the Southeast from a hub in Atlanta, or a restaurant frequented by four
credit card customers. The graph itself captures the information about what is
connected to what. Without any labels, it can describe many different situa­
tions. This is the power of abstraction.
A few points of terminology about graphs. Because graphs are so useful for
visualizing relationships, it is nice when the nodes and edges can be drawn
with no intersecting edges. The graphs in Figure 10.2 have this property. They
are planar graphs, since they can be drawn on a sheet of paper (what mathe­
maticians call a plane) without having any edges intersect. Figure 10.2 shows
two graphs that cannot be drawn without having at least two edges cross.
There is, in fact, a theorem in graph theory that says that if a graph is nonpla­
nar, then lurking inside it is one of the two previously described graphs.
When a path exists between any two nodes in a graph, the graph is said to
be connected. For the rest of this chapter, we assume that all graphs are con­
nected, unless otherwise specified. A path, as its name implies, is an ordered
sequence of nodes connected by edges. Consider a graph where each node
represents a city, and the edges are flights between pairs of cities. On such a
graph, a node is a city and an edge is a flight segment”two cities that are con­
nected by a nonstop flight. A path is an itinerary of flight segments that go
from one city to another, such as from Greenville, South Carolina to Atlanta,
from Atlanta to Chicago, and from Chicago to Peoria.




A fully connected graph with A graph with five nodes
four nodes and six edges. In and four edges.
a fully connected graph, there
is an edge between every pair
of nodes.
Figure 10.1 Two examples of graphs.
324 Chapter 10


Oops! These edges
intersect.




Three nodes cannot connect A fully-connected graph
to three other nodes without with five nodes must also
two edges crossing over have edges that intersect.
each other.
Figure 10.2 Not all graphs can be drawn without having some edges cross over each other.


Figure 10.3 is an example of a weighted graph, one in which the edges have
weights associated with them. In this case, the nodes represent products pur­
chased by customers. The weights on the edges represent the support for the
association, the percentage of market baskets containing both products. Such
graphs provide an approach for solving problems in market basket analysis
and are also a useful means of visualizing market basket data. This product
association graph is an example of an undirected graph. The graph shows that
22.12 percent of market baskets at this health food grocery contain both yellow
peppers and bananas. By itself, this does not explain whether yellow pepper
sales drive banana sales or vice versa, or whether something else drives the
purchase of all yellow fruits and vegetables.
One very common problem in link analysis is finding the shortest path
between two nodes. Which is shortest, though, depends on the weights
assigned to the edges. Consider the graph of flights between cities. Does short­
est refer to distance? To the fewest number of flight segments? To the shortest
flight time? Or to the least expensive? All these questions are answered the
same way using graphs”the only difference is the weights on the edges.
The following two sections describe two classic problems in graph theory
that illustrate the power of graphs to represent and solve problems. Few data
mining problems are exactly like these two problems, but the problems give a
flavor of how the simple construction of graphs leads to some interesting solu­
tions. They are presented to familiarize the reader with graphs by providing
examples of key concepts in graph theory and to provide a stronger basis for
discussing link analysis.
Link Analysis 325


Bananas

Red Leaf




8
Vine Tomatoes




8. 5
3.25

7
Red Peppers
3. Organic Peaches




3
6. 6
5.
21
3.35



<<

. 65
( 137 .)



>>