model input variables such as geography, credit class, and acquisition chan

nel) the probability that having made it as far as today, he or she will leave

before tomorrow. For any one tenure this hazard, as it is called, is quite small,

but it is higher for some tenures than for others. The chance that a customer

will survive to reach some more distant future date can be calculated from the

intervening hazards.

Lessons Learned

The data mining techniques described in this book have applications in fields

as diverse as biotechnology research and manufacturing process control. This

book, however, is written for people who, like the authors, will be applying

these techniques to the kinds of business problems that arise in marketing

and customer relationship management. In most of the book, the focus on

customer-centric applications is implicit in the choice of examples used to

illustrate the techniques. In this chapter, that focus is more explicit.

Data mining is used in support of both advertising and direct marketing to

identify the right audience, choose the best communications channels, and

pick the most appropriate messages. Prospective customers can be compared

to a profile of the intended audience and given a fitness score. Should infor

mation on individual prospects not be available, the same method can be used

Data Mining Applications 121

to assign fitness scores to geographic neighborhoods using data of the type

available form the U.S. census bureau, Statistics Canada, and similar official

sources in many countries.

A common application of data mining in direct modeling is response mod

eling. A response model scores prospects on their likelihood to respond to a

direct marketing campaign. This information can be used to improve the

response rate of a campaign, but is not, by itself, enough to determine cam

paign profitability. Estimating campaign profitability requires reliance on esti

mates of the underlying response rate to a future campaign, estimates of

average order sizes associated with the response, and cost estimates for fulfill

ment and for the campaign itself. A more customer-centric use of response

scores is to choose the best campaign for each customer from among a number

of competing campaigns. This approach avoids the usual problem of indepen

dent, score-based campaigns, which tend to pick the same people every time.

It is important to distinguish between the ability of a model to recognize

people who are interested in a product or service and its ability to recognize

people who are moved to make a purchase based on a particular campaign or

offer. Differential response analysis offers a way to identify the market seg

ments where a campaign will have the greatest impact. Differential response

models seek to maximize the difference in response between a treated group

and a control group rather than trying to maximize the response itself.

Information about current customers can be used to identify likely prospects

by finding predictors of desired outcomes in the information that was known

about current customers before they became customers. This sort of analysis is

valuable for selecting acquisition channels and contact strategies as well as for

screening prospect lists. Companies can increase the value of their customer

data by beginning to track customers from their first response, even before they

become customers, and gathering and storing additional information when

customers are acquired.

Once customers have been acquired, the focus shifts to customer relation

ship management. The data available for active customers is richer than that

available for prospects and, because it is behavioral in nature rather than sim

ply geographic and demographic, it is more predictive. Data mining is used to

identify additional products and services that should be offered to customers

based on their current usage patterns. It can also suggest the best time to make

a cross-sell or up-sell offer.

One of the goals of a customer relationship management program is to

retain valuable customers. Data mining can help identify which customers are

the most valuable and evaluate the risk of voluntary or involuntary churn

associated with each customer. Armed with this information, companies can

target retention offers at customers who are both valuable and at risk, and take

steps to protect themselves from customers who are likely to default.

122 Chapter 4

From a data mining perspective, churn modeling can be approached as

either a binary-outcome prediction problem or through survival analysis.

There are advantages and disadvantages to both approaches. The binary out

come approach works well for a short horizon, while the survival analysis

approach can be used to make forecasts far into the future and provides insight

into customer loyalty and customer value as well.

Y

FL

AM

TE

Team-Fly®

CHAPTER

5

The Lure of Statistics: Data

Mining Using Familiar Tools

For statisticians (and economists too), the term “data mining” has long had a

pejorative meaning. Instead of finding useful patterns in large volumes of

data, data mining has the connotation of searching for data to fit preconceived

ideas. This is much like what politicians do around election time”search for

data to show the success of their deeds; this is certainly not what we mean by

data mining! This chapter is intended to bridge some of the gap between sta

tisticians and data miners.

The two disciplines are very similar. Statisticians and data miners com

monly use many of the same techniques, and statistical software vendors now

include many of the techniques described in the next eight chapters in their

software packages. Statistics developed as a discipline separate from mathe

matics over the past century and a half to help scientists make sense of obser

vations and to design experiments that yield the reproducible and accurate

results we associate with the scientific method. For almost all of this period,

the issue was not too much data, but too little. Scientists had to figure out

how to understand the world using data collected by hand in notebooks.

These quantities were sometimes mistakenly recorded, illegible due to fading

and smudged ink, and so on. Early statisticians were practical people who

invented techniques to handle whatever problem was at hand. Statisticians are

still practical people who use modern techniques as well as the tried and true.

123

124 Chapter 5

What is remarkable and a testament to the founders of modern statistics is

that techniques developed on tiny amounts of data have survived and still

prove their utility. These techniques have proven their worth not only in the

original domains but also in virtually all areas where data is collected, from

agriculture to psychology to astronomy and even to business.

Perhaps the greatest statistician of the twentieth century was R. A. Fisher,

considered by many to be the father of modern statistics. In the 1920s, before

the invention of modern computers, he devised methods for designing and

analyzing experiments. For two years, while living on a farm outside London,

he collected various measurements of crop yields along with potential

explanatory variables”amount of rain and sun and fertilizer, for instance. To

understand what has an effect on crop yields, he invented new techniques

(such as analysis of variance”ANOVA) and performed perhaps a million cal

culations on the data he collected. Although twenty-first-century computer

chips easily handle many millions of calculations in a second, each of Fisher™s

calculations required pulling a lever on a manual calculating machine. Results

trickled in slowly over weeks and months, along with sore hands and calluses.

The advent of computing power has clearly simplified some aspects of

analysis, although its bigger effect is probably the wealth of data produced. Our

goal is no longer to extract every last iota of possible information from each rare

datum. Our goal is instead to make sense of quantities of data so large that they

are beyond the ability of our brains to comprehend in their raw format.

The purpose of this chapter is to present some key ideas from statistics that

have proven to be useful tools for data mining. This is intended to be neither a

thorough nor a comprehensive introduction to statistics; rather, it is an intro

duction to a handful of useful statistical techniques and ideas. These tools are

shown by demonstration, rather than through mathematical proof.

The chapter starts with an introduction to what is probably the most impor

tant aspect of applied statistics”the skeptical attitude. It then discusses looking

at data through a statistician™s eye, introducing important concepts and termi

nology along the way. Sprinkled through the chapter are examples, especially

for confidence intervals and the chi-square test. The final example, using the chi-

square test to understand geography and channel, is an unusual application of

the ideas presented in the chapter. The chapter ends with a brief discussion of

some of the differences between data miners and statisticians”differences in

attitude that are more a matter of degree than of substance.

Occam™s Razor

William of Occam was a Franciscan monk born in a small English town in

1280”not only before modern statistics was invented, but also before the Renais

sance and the printing press. He was an influential philosopher, theologian,

The Lure of Statistics: Data Mining Using Familiar Tools 125

and professor who expounded many ideas about many things, including church

politics. As a monk, he was an ascetic who took his vow of poverty very seri

ously. He was also a fervent advocate of the power of reason, denying the

existence of universal truths and espousing a modern philosophy that was

quite different from the views of most of his contemporaries living in the

Middle Ages.

What does William of Occam have to do with data mining? His name has

become associated with a very simple idea. He himself explained it in Latin

(the language of learning, even among the English, at the time), “Entia non sunt

multiplicanda sine necessitate.” In more familiar English, we would say “the sim

pler explanation is the preferable one” or, more colloquially, “Keep it simple,

stupid.” Any explanation should strive to reduce the number of causes to a

bare minimum. This line of reasoning is referred to as Occam™s Razor and is

William of Occam™s gift to data analysis.

The story of William of Occam had an interesting ending. Perhaps because

of his focus on the power of reason, he also believed that the powers of the

church should be separate from the powers of the state”that the church

should be confined to religious matters. This resulted in his opposition to the

meddling of Pope John XXII in politics and eventually to his own excommuni

cation. He eventually died in Munich during an outbreak of the plague in

1349, leaving a legacy of clear and critical thinking for future generations.

The Null Hypothesis

Occam™s Razor is very important for data mining and statistics, although sta

tistics expresses the idea a bit differently. The null hypothesis is the assumption

that differences among observations are due simply to chance. To give an

example, consider a presidential poll that gives Candidate A 45 percent and

Candidate B 47 percent. Because this data is from a poll, there are several

sources of error, so the values are only approximate estimates of the popular

ity of each candidate. The layperson is inclined to ask, “Are these two values

different?” The statistician phrases the question slightly differently, “What is

the probability that these two values are really the same?”

Although the two questions are very similar, the statistician™s has a bit of an

attitude. This attitude is that the difference may have no significance at all and

is an example of using the null hypothesis. There is an observed difference of

2 percent in this example. However, this observed value may be explained by

the particular sample of people who responded. Another sample may have a

difference of 2 percent in the other direction, or may have a difference of 0 per

cent. All are reasonably likely results from a poll. Of course, if the preferences

differed by 20 percent, then sampling variation is much less likely to be the

cause. Such a large difference would greatly improve the confidence that one

candidate is doing better than the other, and greatly reduce the probability of

the null hypothesis being true.

126 Chapter 5

T I P The simplest explanation is usually the best one”even (or especially) if it

does not prove the hypothesis you want to prove.

This skeptical attitude is very valuable for both statisticians and data min

ers. Our goal is to demonstrate results that work, and to discount the null

hypothesis. One difference between data miners and statisticians is that data

miners are often working with sufficiently large amounts of data that make it

unnecessary to worry about the mechanics of calculating the probability of

something being due to chance.

P-Values

The null hypothesis is not merely an approach to analysis; it can also be quan

tified. The p-value is the probability that the null hypothesis is true. Remember,

when the null hypothesis is true, nothing is really happening, because differ

ences are due to chance. Much of statistics is devoted to determining bounds

for the p-value.

Consider the previous example of the presidential poll. Consider that the

p-value is calculated to be 60 percent (more on how this is done later in the

chapter). This means that there is a 60 percent likelihood that the difference in

the support for the two candidates as measured by the poll is due strictly to

chance and not to the overall support in the general population. In this case,

there is little evidence that the support for the two candidates is different.

Let™s say the p-value is 5 percent, instead. This is a relatively small number,

and it means that we are 95 percent confident that Candidate B is doing better

than Candidate A. Confidence, sometimes called the q-value, is the flip side of

the p-value. Generally, the goal is to aim for a confidence level of at least 90

percent, if not 95 percent or more (meaning that the corresponding p-value is

less than 10 percent, or 5 percent, respectively).

These ideas”null hypothesis, p-value, and confidence”are three basic

ideas in statistics. The next section carries these ideas further and introduces

the statistical concept of distributions, with particular attention to the normal

distribution.

A Look at Data

A statistic refers to a measure taken on a sample of data. Statistics is the study

of these measures and the samples they are measured on. A good place to start,

then, is with such useful measures, and how to look at data.

The Lure of Statistics: Data Mining Using Familiar Tools 127

Looking at Discrete Values

Much of the data used in data mining is discrete by nature, rather than contin

uous. Discrete data shows up in the form of products, channels, regions, and

descriptive information about businesses. This section discusses ways of look

ing at and analyzing discrete fields.

Histograms

The most basic descriptive statistic about discrete fields is the number of

times different values occur. Figure 5.1 shows a histogram of stop reason codes

during a period of time. A histogram shows how often each value occurs in the