. 29
( 137 .)


each customer (or for each group of customers that share the same values for
model input variables such as geography, credit class, and acquisition chan­
nel) the probability that having made it as far as today, he or she will leave
before tomorrow. For any one tenure this hazard, as it is called, is quite small,
but it is higher for some tenures than for others. The chance that a customer
will survive to reach some more distant future date can be calculated from the
intervening hazards.

Lessons Learned
The data mining techniques described in this book have applications in fields
as diverse as biotechnology research and manufacturing process control. This
book, however, is written for people who, like the authors, will be applying
these techniques to the kinds of business problems that arise in marketing
and customer relationship management. In most of the book, the focus on
customer-centric applications is implicit in the choice of examples used to
illustrate the techniques. In this chapter, that focus is more explicit.
Data mining is used in support of both advertising and direct marketing to
identify the right audience, choose the best communications channels, and
pick the most appropriate messages. Prospective customers can be compared
to a profile of the intended audience and given a fitness score. Should infor­
mation on individual prospects not be available, the same method can be used
Data Mining Applications 121

to assign fitness scores to geographic neighborhoods using data of the type
available form the U.S. census bureau, Statistics Canada, and similar official
sources in many countries.
A common application of data mining in direct modeling is response mod­
eling. A response model scores prospects on their likelihood to respond to a
direct marketing campaign. This information can be used to improve the
response rate of a campaign, but is not, by itself, enough to determine cam­
paign profitability. Estimating campaign profitability requires reliance on esti­
mates of the underlying response rate to a future campaign, estimates of
average order sizes associated with the response, and cost estimates for fulfill­
ment and for the campaign itself. A more customer-centric use of response
scores is to choose the best campaign for each customer from among a number
of competing campaigns. This approach avoids the usual problem of indepen­
dent, score-based campaigns, which tend to pick the same people every time.
It is important to distinguish between the ability of a model to recognize
people who are interested in a product or service and its ability to recognize
people who are moved to make a purchase based on a particular campaign or
offer. Differential response analysis offers a way to identify the market seg­
ments where a campaign will have the greatest impact. Differential response
models seek to maximize the difference in response between a treated group
and a control group rather than trying to maximize the response itself.
Information about current customers can be used to identify likely prospects
by finding predictors of desired outcomes in the information that was known
about current customers before they became customers. This sort of analysis is
valuable for selecting acquisition channels and contact strategies as well as for
screening prospect lists. Companies can increase the value of their customer
data by beginning to track customers from their first response, even before they
become customers, and gathering and storing additional information when
customers are acquired.
Once customers have been acquired, the focus shifts to customer relation­
ship management. The data available for active customers is richer than that
available for prospects and, because it is behavioral in nature rather than sim­
ply geographic and demographic, it is more predictive. Data mining is used to
identify additional products and services that should be offered to customers
based on their current usage patterns. It can also suggest the best time to make
a cross-sell or up-sell offer.
One of the goals of a customer relationship management program is to
retain valuable customers. Data mining can help identify which customers are
the most valuable and evaluate the risk of voluntary or involuntary churn
associated with each customer. Armed with this information, companies can
target retention offers at customers who are both valuable and at risk, and take
steps to protect themselves from customers who are likely to default.
122 Chapter 4

From a data mining perspective, churn modeling can be approached as
either a binary-outcome prediction problem or through survival analysis.
There are advantages and disadvantages to both approaches. The binary out­
come approach works well for a short horizon, while the survival analysis
approach can be used to make forecasts far into the future and provides insight
into customer loyalty and customer value as well.



The Lure of Statistics: Data
Mining Using Familiar Tools

For statisticians (and economists too), the term “data mining” has long had a
pejorative meaning. Instead of finding useful patterns in large volumes of
data, data mining has the connotation of searching for data to fit preconceived
ideas. This is much like what politicians do around election time”search for
data to show the success of their deeds; this is certainly not what we mean by
data mining! This chapter is intended to bridge some of the gap between sta­
tisticians and data miners.
The two disciplines are very similar. Statisticians and data miners com­
monly use many of the same techniques, and statistical software vendors now
include many of the techniques described in the next eight chapters in their
software packages. Statistics developed as a discipline separate from mathe­
matics over the past century and a half to help scientists make sense of obser­
vations and to design experiments that yield the reproducible and accurate
results we associate with the scientific method. For almost all of this period,
the issue was not too much data, but too little. Scientists had to figure out
how to understand the world using data collected by hand in notebooks.
These quantities were sometimes mistakenly recorded, illegible due to fading
and smudged ink, and so on. Early statisticians were practical people who
invented techniques to handle whatever problem was at hand. Statisticians are
still practical people who use modern techniques as well as the tried and true.

124 Chapter 5

What is remarkable and a testament to the founders of modern statistics is
that techniques developed on tiny amounts of data have survived and still
prove their utility. These techniques have proven their worth not only in the
original domains but also in virtually all areas where data is collected, from
agriculture to psychology to astronomy and even to business.
Perhaps the greatest statistician of the twentieth century was R. A. Fisher,
considered by many to be the father of modern statistics. In the 1920s, before
the invention of modern computers, he devised methods for designing and
analyzing experiments. For two years, while living on a farm outside London,
he collected various measurements of crop yields along with potential
explanatory variables”amount of rain and sun and fertilizer, for instance. To
understand what has an effect on crop yields, he invented new techniques
(such as analysis of variance”ANOVA) and performed perhaps a million cal­
culations on the data he collected. Although twenty-first-century computer
chips easily handle many millions of calculations in a second, each of Fisher™s
calculations required pulling a lever on a manual calculating machine. Results
trickled in slowly over weeks and months, along with sore hands and calluses.
The advent of computing power has clearly simplified some aspects of
analysis, although its bigger effect is probably the wealth of data produced. Our
goal is no longer to extract every last iota of possible information from each rare
datum. Our goal is instead to make sense of quantities of data so large that they
are beyond the ability of our brains to comprehend in their raw format.
The purpose of this chapter is to present some key ideas from statistics that
have proven to be useful tools for data mining. This is intended to be neither a
thorough nor a comprehensive introduction to statistics; rather, it is an intro­
duction to a handful of useful statistical techniques and ideas. These tools are
shown by demonstration, rather than through mathematical proof.
The chapter starts with an introduction to what is probably the most impor­
tant aspect of applied statistics”the skeptical attitude. It then discusses looking
at data through a statistician™s eye, introducing important concepts and termi­
nology along the way. Sprinkled through the chapter are examples, especially
for confidence intervals and the chi-square test. The final example, using the chi-
square test to understand geography and channel, is an unusual application of
the ideas presented in the chapter. The chapter ends with a brief discussion of
some of the differences between data miners and statisticians”differences in
attitude that are more a matter of degree than of substance.

Occam™s Razor
William of Occam was a Franciscan monk born in a small English town in
1280”not only before modern statistics was invented, but also before the Renais­
sance and the printing press. He was an influential philosopher, theologian,
The Lure of Statistics: Data Mining Using Familiar Tools 125

and professor who expounded many ideas about many things, including church
politics. As a monk, he was an ascetic who took his vow of poverty very seri­
ously. He was also a fervent advocate of the power of reason, denying the
existence of universal truths and espousing a modern philosophy that was
quite different from the views of most of his contemporaries living in the
Middle Ages.
What does William of Occam have to do with data mining? His name has
become associated with a very simple idea. He himself explained it in Latin
(the language of learning, even among the English, at the time), “Entia non sunt
multiplicanda sine necessitate.” In more familiar English, we would say “the sim­
pler explanation is the preferable one” or, more colloquially, “Keep it simple,
stupid.” Any explanation should strive to reduce the number of causes to a
bare minimum. This line of reasoning is referred to as Occam™s Razor and is
William of Occam™s gift to data analysis.
The story of William of Occam had an interesting ending. Perhaps because
of his focus on the power of reason, he also believed that the powers of the
church should be separate from the powers of the state”that the church
should be confined to religious matters. This resulted in his opposition to the
meddling of Pope John XXII in politics and eventually to his own excommuni­
cation. He eventually died in Munich during an outbreak of the plague in
1349, leaving a legacy of clear and critical thinking for future generations.

The Null Hypothesis
Occam™s Razor is very important for data mining and statistics, although sta­
tistics expresses the idea a bit differently. The null hypothesis is the assumption
that differences among observations are due simply to chance. To give an
example, consider a presidential poll that gives Candidate A 45 percent and
Candidate B 47 percent. Because this data is from a poll, there are several
sources of error, so the values are only approximate estimates of the popular­
ity of each candidate. The layperson is inclined to ask, “Are these two values
different?” The statistician phrases the question slightly differently, “What is
the probability that these two values are really the same?”
Although the two questions are very similar, the statistician™s has a bit of an
attitude. This attitude is that the difference may have no significance at all and
is an example of using the null hypothesis. There is an observed difference of
2 percent in this example. However, this observed value may be explained by
the particular sample of people who responded. Another sample may have a
difference of 2 percent in the other direction, or may have a difference of 0 per­
cent. All are reasonably likely results from a poll. Of course, if the preferences
differed by 20 percent, then sampling variation is much less likely to be the
cause. Such a large difference would greatly improve the confidence that one
candidate is doing better than the other, and greatly reduce the probability of
the null hypothesis being true.
126 Chapter 5

T I P The simplest explanation is usually the best one”even (or especially) if it
does not prove the hypothesis you want to prove.

This skeptical attitude is very valuable for both statisticians and data min­
ers. Our goal is to demonstrate results that work, and to discount the null
hypothesis. One difference between data miners and statisticians is that data
miners are often working with sufficiently large amounts of data that make it
unnecessary to worry about the mechanics of calculating the probability of
something being due to chance.

The null hypothesis is not merely an approach to analysis; it can also be quan­
tified. The p-value is the probability that the null hypothesis is true. Remember,
when the null hypothesis is true, nothing is really happening, because differ­
ences are due to chance. Much of statistics is devoted to determining bounds
for the p-value.
Consider the previous example of the presidential poll. Consider that the
p-value is calculated to be 60 percent (more on how this is done later in the
chapter). This means that there is a 60 percent likelihood that the difference in
the support for the two candidates as measured by the poll is due strictly to
chance and not to the overall support in the general population. In this case,
there is little evidence that the support for the two candidates is different.
Let™s say the p-value is 5 percent, instead. This is a relatively small number,
and it means that we are 95 percent confident that Candidate B is doing better
than Candidate A. Confidence, sometimes called the q-value, is the flip side of
the p-value. Generally, the goal is to aim for a confidence level of at least 90
percent, if not 95 percent or more (meaning that the corresponding p-value is
less than 10 percent, or 5 percent, respectively).
These ideas”null hypothesis, p-value, and confidence”are three basic
ideas in statistics. The next section carries these ideas further and introduces
the statistical concept of distributions, with particular attention to the normal

A Look at Data
A statistic refers to a measure taken on a sample of data. Statistics is the study
of these measures and the samples they are measured on. A good place to start,
then, is with such useful measures, and how to look at data.
The Lure of Statistics: Data Mining Using Familiar Tools 127

Looking at Discrete Values
Much of the data used in data mining is discrete by nature, rather than contin­
uous. Discrete data shows up in the form of products, channels, regions, and
descriptive information about businesses. This section discusses ways of look­
ing at and analyzing discrete fields.

The most basic descriptive statistic about discrete fields is the number of
times different values occur. Figure 5.1 shows a histogram of stop reason codes
during a period of time. A histogram shows how often each value occurs in the


. 29
( 137 .)