. 15
( 137 .)


the White House.
When the Washington Redskins win their last home game, the incum­

bent party keeps the White House.
In U.S. presidential contests, the taller man usually wins.

The first pattern (the one involving off-year elections) seems explainable in
purely political terms. Because there is an underlying explanation, this pattern
seems likely to continue into the future and therefore has predictive value. The
next two alleged predictors, the ones involving sporting events, seem just as
clearly to have no predictive value. No matter how many times Republicans
and the American League may have shared victories in the past (and the
authors have not researched this point), there is no reason to expect the associ­
ation to continue in the future.
What about candidates™ heights? At least since 1945 when Truman (who was
short, but taller than Dewey) was elected, the election in which Carter beat
46 Chapter 3

Ford is the only one where the shorter candidate won. (So long as “winning”
is defined as “receiving the most votes” so that the 2000 election that pitted
6'1'' Gore against the 6'0'' Bush still fits the pattern.) Height does not seem to
have anything to do with the job of being president. On the other hand, height
is positively correlated with income and other social marks of success so
consciously or unconsciously, voters may perceive a taller candidate as more
presidential. As this chapter explains, the right way to decide if a rule is stable
and predictive is to compare its performance on multiple samples selected at
random from the same population. In the case of presidential height, we leave
this as an exercise for the reader. As is often the case, the hardest part of the
task will be collecting the data”even in the age of Google, it is not easy to
locate the heights of unsuccessful presidential candidates from the eighteenth,
nineteenth, and twentieth centuries!
The technical term for finding patterns that fail to generalize is overfitting.
Overfitting leads to unstable models that work one day, but not the next.
Building stable models is the primary goal of the data mining methodology.

The Model Set May Not Reflect the Relevant Population
The model set is the collection of historical data that is used to develop data
mining models. For inferences drawn from the model set to be valid, the
model set must reflect the population that the model is meant to describe, clas­
sify, or score. A sample that does not properly reflect its parent population is
biased. Using a biased sample as a model set is a recipe for learning things that
are not true. It is also hard to avoid. Consider:
Customers are not like prospects.

Survey responders are not like nonresponders.

People who read email are not like people who do not read email.

People who register on a Web site are not like people who fail to register.

After an acquisition, customers from the acquired company are not nec­

essarily like customers from the acquirer.

Records with no missing values reflect a different population from


records with missing values.

Customers are not like prospects because they represent people who
responded positively to whatever messages, offers, and promotions were made
to attract customers in the past. A study of current customers is likely to suggest
more of the same. If past campaigns have gone after wealthy, urban consumers,
then any comparison of current customers with the general population will
likely show that customers tend to be wealthy and urban. Such a model may
miss opportunities in middle-income suburbs. The consequences of using a
biased sample can be worse than simply a missed marketing opportunity.
Data Mining Methodology and Best Practices 47

In the United States, there is a history of “redlining,” the illegal practice of
refusing to write loans or insurance policies in certain neighborhoods. A
search for patterns in the historical data from a company that had a history of
redlining would reveal that people in certain neighborhoods are unlikely to be
customers. If future marketing efforts were based on that finding, data mining
would help perpetuate an illegal and unethical practice.
Careful attention to selecting and sampling data for the model set is crucial
to successful data mining.

Data May Be at the Wrong Level of Detail
In more than one industry, we have been told that usage often goes down in
the month before a customer leaves. Upon closer examination, this turns out to
be an example of learning something that is not true. Figure 3.1 shows the
monthly minutes of use for a cellular telephone subscriber. For 7 months, the
subscriber used about 100 minutes per month. Then, in the eighth month,
usage went down to about half that. In the ninth month, there was no usage
at all.
This subscriber appears to fit the pattern in which a month with decreased
usage precedes abandonment of the service. But appearances are deceiving.
Looking at minutes of use by day instead of by month would show that the
customer continued to use the service at a constant rate until the middle of the
month and then stopped completely, presumably because on that day, he or
she began using a competing service. The putative period of declining usage
does not actually exist and so certainly does not provide a window of oppor­
tunity for retaining the customer. What appears to be a leading indicator is
actually a trailing one.

Minutes of Use by Tenure

1 2 3 4 5 6 7 8 9 10 11

Figure 3.1 Does declining usage in month 8 predict attrition in month 9?
48 Chapter 3

Figure 3.2 shows another example of confusion caused by aggregation. Sales
appear to be down in October compared to August and September. The pic­
ture comes from a business that has sales activity only on days when the finan­
cial markets are open. Because of the way that weekends and holidays fell in
2003, October had fewer trading days than August and September. That fact
alone accounts for the entire drop-off in sales.
In the previous examples, aggregation led to confusion. Failure to aggregate
to the appropriate level can also lead to confusion. In one case, data provided
by a charitable organization showed an inverse correlation between donors™
likelihood to respond to solicitations and the size of their donations. Those
more likely to respond sent smaller checks. This counterintuitive finding is a
result of the large number of solicitations the charity sent out to its supporters
each year. Imagine two donors, each of whom plans to give $500 to the charity.
One responds to an offer in January by sending in the full $500 contribution
and tosses the rest of the solicitation letters in the trash. The other sends a $100
check in response to each of five solicitations. On their annual income tax
returns, both donors report having given $500, but when seen at the individ­
ual campaign level, the second donor seems much more responsive. When
aggregated to the yearly level, the effect disappears.

Learning Things That Are True, but Not Useful
Although not as dangerous as learning things that aren™t true, learning things
that aren™t useful is more common.

Sales by Month (2003)







August September October
Figure 3.2 Did sales drop off in October?
Data Mining Methodology and Best Practices 49

Learning Things That Are Already Known
Data mining should provide new information. Many of the strongest patterns in
data represent things that are already known. People over retirement age tend
not to respond to offers for retirement savings plans. People who live where there
is no home delivery do not become newspaper subscribers. Even though they
may respond to subscription offers, service never starts. For the same reason,
people who live where there are no cell towers tend not to purchase cell phones.
Often, the strongest patterns reflect business rules. If data mining “discov­
ers” that people who have anonymous call blocking also have caller ID, it is
perhaps because anonymous call blocking is only sold as part of a bundle of
services that also includes caller ID. If there are no sales of certain products in
a particular location, it is possible that they are not offered there. We have seen
many such discoveries. Not only are these patterns uninteresting, their
strength may obscure less obvious patterns.
Learning things that are already known does serve one useful purpose. It
demonstrates that, on a technical level, the data mining effort is working and
the data is reasonably accurate. This can be quite comforting. If the data and
the data mining techniques applied to it are powerful enough to discover
things that are known to be true, it provides confidence that other discoveries
are also likely to be true. It is also true that data mining often uncovers things
that ought to have been known, but were not; that retired people do not
respond well to solicitations for retirement savings accounts, for instance.

Learning Things That Can™t Be Used
It can also happen that data mining uncovers relationships that are both true
and previously unknown, but still hard to make use of. Sometimes the prob­
lem is regulatory. A customer™s wireless calling patterns may suggest an affin­
ity for certain land-line long-distance packages, but a company that provides
both services may not be allowed to take advantage of the fact. Similarly, a cus-
tomer™s credit history may be predictive of future insurance claims, but regu­
lators may prohibit making underwriting decisions based on it.
Other times, data mining reveals that important outcomes are outside the
company™s control. A product may be more appropriate for some climates than
others, but it is hard to change the weather. Service may be worse in some
regions for reasons of topography, but that is also hard to change.

T I P Sometimes it is only a failure of imagination that makes new information
appear useless. A study of customer attrition is likely to show that the strongest
predictors of customers leaving is the way they were acquired. It is too late to
go back and change that for existing customers, but that does not make the
information useless. Future attrition can be reduced by changing the mix of
acquisition channels to favor those that bring in longer-lasting customers.
50 Chapter 3

The data mining methodology is designed to steer clear of the Scylla of
learning things that aren™t true and the Charybdis of not learning anything
useful. In a more positive light, the methodology is designed to ensure that the
data mining effort leads to a stable model that successfully addresses the busi­
ness problem it is designed to solve.

Hypothesis Testing
Hypothesis testing is the simplest approach to integrating data into a
company™s decision-making processes. The purpose of hypothesis testing is
to substantiate or disprove preconceived ideas, and it is a part of almost all
data mining endeavors. Data miners often bounce back and forth between
approaches, first thinking up possible explanations for observed behavior
(often with the help of business experts) and letting those hypotheses
dictate the data to be analyzed. Then, letting the data suggest new hypotheses
to test.
Hypothesis testing is what scientists and statisticians traditionally spend
their lives doing. A hypothesis is a proposed explanation whose validity can
be tested by analyzing data. Such data may simply be collected by observation
or generated through an experiment, such as a test mailing. Hypothesis testing
is at its most valuable when it reveals that the assumptions that have been
guiding a company™s actions in the marketplace are incorrect. For example,
suppose that a company™s advertising is based on a number of hypotheses
about the target market for a product or service and the nature of the
responses. It is worth testing whether these hypotheses are borne out by actual
responses. One approach is to use different call-in numbers in different ads
and record the number that each responder dials. Information collected during
the call can then be compared with the profile of the population the advertise­
ment was designed to reach.

T I P Each time a company solicits a response from its customers, whether
through advertising or a more direct form of communication, it has an

opportunity to gather information. Slight changes in the design of the

communication, such as including a way to identify the channel when a

prospect responds, can greatly increase the value of the data collected.

By its nature, hypothesis testing is ad hoc, so the term “methodology” might
be a bit strong. However, there are some identifiable steps to the process, the
first and most important of which is generating good ideas to test.
Data Mining Methodology and Best Practices 51


. 15
( 137 .)