. 16
( 137 .)


Generating Hypotheses
The key to generating hypotheses is getting diverse input from throughout the
organization and, where appropriate, outside it as well. Often, all that is needed
to start the ideas flowing is a clear statement of the problem itself”especially if
it is something that has not previously been recognized as a problem.
It happens more often than one might suppose that problems go unrecog­
nized because they are not captured by the metrics being used to evaluate the
organization™s performance. If a company has always measured its sales force
on the number of new sales made each month, the sales people may never
have given much thought to the question of how long new customers remain
active or how much they spend over the course of their relationship with the
firm. When asked the right questions, however, the sales force may have
insights into customer behavior that marketing, with its greater distance from
the customer, has missed.

Testing Hypotheses
Consider the following hypotheses:
Frequent roamers are less sensitive than others to the price per minute

of cellular phone time.
Families with high-school age children are more likely to respond to a

home equity line offer than others.
The save desk in the call center is saving customers who would have

returned anyway.
Such hypotheses must be transformed in a way that allows them to be tested
on real data. Depending on the hypotheses, this may mean interpreting a single
value returned from a simple query, plowing through a collection of association
rules generated by market basket analysis, determining the significance of a
correlation found by a regression model, or designing a controlled experiment.
In all cases, careful critical thinking is necessary to be sure that the result is not
biased in unexpected ways.
Proper evaluation of data mining results requires both analytical and busi­
ness knowledge. Where these are not present in the same person, it takes cross-
functional cooperation to make good use of the new information.

Models, Profiling, and Prediction
Hypothesis testing is certainly useful, but there comes a time when it is not
sufficient. The data mining techniques described in the rest of this book are all
designed for learning new things by creating models based on data.
52 Chapter 3

In the most general sense, a model is an explanation or description of how
something works that reflects reality well enough that it can be used to make
inferences about the real world. Without realizing it, human beings use
models all the time. When you see two restaurants and decide that the one
with white tablecloths and real flowers on each table is more expensive than
the one with Formica tables and plastic flowers, you are making an inference
based on a model you carry in your head. When you set out to walk to the
store, you again consult a mental model of the town.
Data mining is all about creating models. As shown in Figure 3.3, models
take a set of inputs and produce an output. The data used to create the model
is called a model set. When models are applied to new data, this is called the
score set. The model set has three components, which are discussed in more
detail later in the chapter:

The training set is used to build a set of models.

The validation set1 is used to choose the best model of these.
The test set is used to determine how the model performs on unseen

Data mining techniques can be used to make three kinds of models for three

kinds of tasks: descriptive profiling, directed profiling, and prediction. The
distinctions are not always clear.
Descriptive models describe what is in the data. The output is one or more
charts or numbers or graphics that explain what is going on. Hypothesis test­
ing often produces descriptive models. On the other hand, both directed profil­
ing and prediction have a goal in mind when the model is being built. The
difference between them has to do with time frames, as shown in Figure 3.4. In
profiling models, the target is from the same time frame as the input. In pre­
dictive models, the target is from a later time frame. Prediction means finding
patterns in data from one period that are capable of explaining outcomes in a
later period. The reason for emphasizing the distinction between profiling and
prediction is that it has implications for the modeling methodology, especially
the treatment of time in the creation of the model set.

Figure 3.3 Models take an input and produce an output.

1 The first edition called the three partitions of the model set the training set, the test set, and the
evaluation set. The authors still like that terminology, but standard usage in the data mining com­
munity is now training/validation/test. To avoid confusion, this edition adopts the training/
validation/test nomenclature.

Data Mining Methodology and Best Practices 53

Input variables Target variable


August 2004 September 2004 October 2004 November 2004

1 2 3 4 5 6 7 1 2 3 4 1 2 1 2 3 4 5 6
14 10 11 12 13
8 9 10 11 12 13 5 6 7 8 9 10 11 3 4 5 6 7 8 9 7 8 9
15 16 17 18 19 20 12 13 14 15 16 17 18 10 11 12 13 14 15 16 14 15 16 17 18 19 20
22 23 24 25 26 27 19 20 21 22 23 24 25 17 18 19 20 21 22 23 21 22 23 24 25 26 27

29 30 31 26 27 28 29 30 24 25 26 27 28 29 30 28 29 30

Input variables Target variable
Figure 3.4 Profiling and prediction differ only in the time frames of the input and target

Profiling is a familiar approach to many problems. It need not involve any
sophisticated data analysis. Surveys, for instance, are one common method of
building customer profiles. Surveys reveal what customers and prospects look
like, or at least the way survey responders answer questions.
Profiles are often based on demographic variables, such as geographic loca­
tion, gender, and age. Since advertising is sold according to these same vari­
ables, demographic profiles can turn directly into media strategies. Simple
profiles are used to set insurance premiums. A 17-year-old male pays more for
car insurance than a 60-year-old female. Similarly, the application form for a
simple term life insurance policy asks about age, sex, and smoking”and not
much more.
Powerful though it is, profiling has serious limitations. One is the inability
to distinguish cause and effect. So long as the profiling is based on familiar
demographic variables, this is not noticeable. If men buy more beer than
women, we do not have to wonder whether beer drinking might be the cause
54 Chapter 3

of maleness. It seems safe to assume that the link is from men to beer and not
vice versa.
With behavioral data, the direction of causality is not always so clear. Con­
sider a couple of actual examples from real data mining projects:
People who have purchased certificates of deposit (CDs) have little or

no money in their savings accounts.
Customers who use voice mail make a lot of short calls to their own

Not keeping money in a savings account is a common behavior of CD hold­
ers, just as being male is a common feature of beer drinkers. Beer companies seek
out males to market their product, so should banks seek out people with no
money in savings in order to sell them certificates of deposit? Probably not! Pre­
sumably, the CD holders have no money in their savings accounts because they
used that money to buy CDs. A more common reason for not having money in a
savings account is not having any money, and people with no money are not
likely to purchase certificates of deposit. Similarly, the voice mail users call their
own number so much because in this particular system that is one way to check
voice mail. The pattern is useless for finding prospective users.

Profiling uses data from the past to describe what happened in the past. Pre­
diction goes one step further. Prediction uses data from the past to predict what
is likely to happen in the future. This is a more powerful use of data. While the
correlation between low savings balances and CD ownership may not be use­
ful in a profile of CD holders, it is likely that having a high savings balance is (in
combination with other indicators) a predictor of future CD purchases.
Building a predictive model requires separation in time between the model
inputs or predictors and the model output, the thing to be predicted. If this
separation is not maintained, the model will not work. This is one example of
why it is important to follow a sound data mining methodology.

The Methodology
The data mining methodology has 11 steps.
1. Translate the business problem into a data mining problem.
2. Select appropriate data.
3. Get to know the data.
4. Create a model set.
5. Fix problems with the data.
Data Mining Methodology and Best Practices 55

6. Transform data to bring information to the surface.
7. Build models.
8. Asses models.
9. Deploy models.
10. Assess results.
11. Begin again.
As shown in Figure 3.5, the data mining process is best thought of as a set of
nested loops rather than a straight line. The steps do have a natural order, but
it is not necessary or even desirable to completely finish with one before mov­
ing on to the next. And things learned in later steps will cause earlier ones to
be revisited.

Translate the
business problem
into a data mining

Select appropriate

Get to know
Assess results. the data.

Create a model set.
Deploy models.

Fix problems with
the data.
Assess models.

Transform data.


. 16
( 137 .)