. 18
( 137 .)



Medical insurance claims data

Web log data

E-commerce server application logs

Direct mail response records

Call-center records, including memos written by the call-center reps

Printing press run records
Data Mining Methodology and Best Practices 61

Motor vehicle registration records

Noise level in decibels from microphones placed in communities near

an airport
Telephone call detail records

Survey response data

Demographic and lifestyle data

Economic data

Hourly weather readings (wind direction, wind strength, precipitation)

Census data

Once the business problem has been formulated, it is possible to form a wish
list of data that would be nice to have. For a study of existing customers, this
should include data from the time they were acquired (acquisition channel,
acquisition date, original product mix, original credit score, and so on), similar
data describing their current status, and behavioral data accumulated during
their tenure. Of course, it may not be possible to find everything on the wish
list, but it is better to start out with an idea of what you would like to find.
Occasionally, a data mining effort starts without a specific business prob­
lem. A company becomes aware that it is not getting good value from the data
it collects, and sets out to determine whether the data could be made more use­
ful through data mining. The trick to making such a project successful is to
turn it into a project designed to solve a specific problem. The first step is to
explore the available data and make a list of candidate business problems.
Invite business users to create a lengthy wish list which can then be reduced to
a small number of achievable goals”the data mining problem.

What Is Available?
The first place to look for data is in the corporate data warehouse. Data in the
warehouse has already been cleaned and verified and brought together from
multiple sources. A single data model hopefully ensures that similarly named
fields have the same meaning and compatible data types throughout the data­
base. The corporate data warehouse is a historical repository; new data is
appended, but the historical data is never changed. Since it was designed for
decision support, the data warehouse provides detailed data that can be aggre­
gated to the right level for data mining. Chapter 15 goes into more detail about
the relationship between data mining and data warehousing.
The only problem is that in many organizations such a data warehouse does
not actually exist or one or more data warehouses exist, but don™t live up to the
promises. That being the case, data miners must seek data from various
departmental databases and from within the bowels of operational systems.
62 Chapter 3

These operational systems are designed to perform a certain task such as
claims processing, call switching, order entry, or billing. They are designed
with the primary goal of processing transactions quickly and accurately. The
data is in whatever format best suits that goal and the historical record, if any,
is likely to be in a tape archive. It may require significant political and pro­
gramming effort to get the data in a form useful for knowledge discovery.
In some cases, operational procedures have to be changed in order to supply
data. We know of one major catalog retailer that wanted to analyze the buying
habits of its customers so as to market differentially to new customers and long-
standing customers. Unfortunately, anyone who hadn™t ordered anything in
the past six months was routinely purged from the records. The substantial
population of people who loyally used the catalog for Christmas shopping, but
not during the rest of the year, went unrecognized and indeed were unrecogniz­

able, until the company began keeping historical customer records.

In many companies, determining what data is available is surprisingly dif­
ficult. Documentation is often missing or out of date. Typically, there is no one
person who can provide all the answers. Determining what is available
requires looking through data dictionaries, interviewing users and database
administrators, and examining existing reports.

WA R N I N G Use database documentation and data dictionaries as a guide
but do not accept them as unalterable fact. The fact that a field is defined in a
table or mentioned in a document does not mean the field exists, is actually
available for all customers, and is correctly loaded.

How Much Data Is Enough?
Unfortunately, there is no simple answer to this question. The answer depends
on the particular algorithms employed, the complexity of the data, and the rel­
ative frequency of possible outcomes. Statisticians have spent years develop­
ing tests for determining the smallest model set that can be used to produce a
model. Machine learning researchers have spent much time and energy devis­
ing ways to let parts of the training set be reused for validation and test. All of
this work ignores an important point: In the commercial world, statisticians
are scarce, and data is anything but.
In any case, where data is scarce, data mining is not only less effective, it is
less likely to be useful. Data mining is most useful when the sheer volume of
data obscures patterns that might be detectable in smaller databases. There­
fore, our advice is to use so much data that the questions about what consti­
tutes an adequate sample size simply do not arise. We generally start with tens
of thousands if not millions of preclassified records so that the training, vali­
dation, and test sets each contain many thousands of records.

Data Mining Methodology and Best Practices 63

In data mining, more is better, but with some caveats. The first caveat has to
do with the relationship between the size of the model set and its density.
Density refers to the prevalence of the outcome of interests. Often the target
variable represents something relatively rare. It is rare for prospects to respond
to a direct mail offer. It is rare for credit card holders to commit fraud. In any
given month, it is rare for newspaper subscribers to cancel their subscriptions.
As discussed later in this chapter (in the section on creating the model set), it is
desirable for the model set to be balanced with equal numbers of each of the
outcomes during the model-building process. A smaller, balanced sample is
preferable to a larger one with a very low proportion of rare outcomes.
The second caveat has to do with the data miner™s time. When the model set
is large enough to build good, stable models, making it larger is counterproduc­
tive because everything will take longer to run on the larger dataset. Since data
mining is an iterative process, the time spent waiting for results can become very
large if each run of a modeling routine takes hours instead of minutes.
A simple test for whether the sample used for modeling is large enough is to
try doubling it and measure the improvement in the model™s accuracy. If the
model created using the larger sample is significantly better than the one cre­
ated using the smaller sample, then the smaller sample is not big enough. If
there is no improvement, or only a slight improvement, then the original sam­
ple is probably adequate.

How Much History Is Required?
Data mining uses data from the past to make predictions about the future. But
how far in the past should the data come from? This is another simple question
without a simple answer. The first thing to consider is seasonality. Most busi­
nesses display some degree of seasonality. Sales go up in the fourth quarter.
Leisure travel goes up in the summer. There should be enough historical data
to capture periodic events of this sort.
On the other hand, data from too far in the past may not be useful for min­
ing because of changing market conditions. This is especially true when some
external event such as a change in the regulatory regime has intervened. For
many customer-focused applications, 2 to 3 years of history is appropriate.
However, even in such cases, data about the beginning of the customer rela­
tionship often proves very valuable”what was the initial channel, what was
the initial offer, how did the customer initially pay, and so on.

How Many Variables?
Inexperienced data miners are sometimes in too much of a hurry to throw out
variables that seem unlikely to be interesting, keeping only a few carefully
chosen variables they expect to be important. The data mining approach calls
for letting the data itself reveal what is and is not important.
64 Chapter 3

Often, variables that had previously been ignored turn out to have predic­
tive value when used in combination with other variables. For example, one
credit card issuer, that had never included data on cash advances in its cus­
tomer profitability models, discovered through data mining that people who
use cash advances only in November and December are highly profitable. Pre­
sumably, these are people who are prudent enough to avoid borrowing money
at high interest rates most of the time (a prudence that makes them less likely
to default than habitual users of cash advances) but who need some extra cash
for the holidays and are willing to pay exorbitant interest to get it.
It is true that a final model is usually based on just a few variables. But these
few variables are often derived by combining several other variables, and it may
not have been obvious at the beginning which ones end up being important.

What Must the Data Contain?
At a minimum, the data must contain examples of all possible outcomes of
interest. In directed data mining, where the goal is to predict the value of a par­
ticular target variable, it is crucial to have a model set comprised of preclassi­
fied data. To distinguish people who are likely to default on a loan from people
who are not, there needs to be thousands of examples from each class to build
a model that distinguishes one from the other. When a new applicant comes
along, his or her application is compared with those of past customers, either
directly, as in memory-based reasoning, or indirectly through rules or neural
networks derived from the historical data. If the new application “looks like”
those of people who defaulted in the past, it will be rejected.
Implicit in this description is the idea that it is possible to know what hap­
pened in the past. To learn from our mistakes, we first have to recognize that
we have made them. This is not always possible. One company had to give up
on an attempt to use directed knowledge discovery to build a warranty claims
fraud model because, although they suspected that some claims might be
fraudulent, they had no idea which ones. Without a training set containing
warranty claims clearly marked as fraudulent or legitimate, it was impossible
to apply these techniques. Another company wanted a direct mail response
model built, but could only supply data on people who had responded to past
campaigns. They had not kept any information on people who had not
responded so there was no basis for comparison.

Step Three: Get to Know the Data
It is hard to overstate the importance of spending time exploring the data
before rushing into building models. Because of its importance, Chapter 17 is
devoted to this topic in detail. Good data miners seem to rely heavily on
Data Mining Methodology and Best Practices 65

intuition”somehow being able to guess what a good derived variable to try
might be, for instance. The only way to develop intuition for what is going on
in an unfamiliar dataset is to immerse yourself in it. Along the way, you are
likely to discover many data quality problems and be inspired to ask many
questions that would not otherwise have come up.

Examine Distributions
A good first step is to examine a histogram of each variable in the dataset and
think about what it is telling you. Make note of anything that seems surpris­
ing. If there is a state code variable, is California the tallest bar? If not, why not?
Are some states missing? If so, does it seem plausible that this company does
not do business in those states? If there is a gender variable, are there similar
numbers of men and women? If not, is that unexpected? Pay attention to the
range of each variable. Do variables that should be counts take on negative
values? Do the highest and lowest values sound like reasonable values for that
variable to take on? Is the mean much different from the median? How many
missing values are there? Have the variable counts been consistent over time?

T I P As soon as you get your hands on a data file from a new source, it is a
good idea to profile the data to understand what is going on, including getting
counts and summary statistics for each field, counts of the number of distinct
values taken on by categorical variables, and where appropriate, cross-
tabulations such as sales by product by region. In addition to providing insight
into the data, the profiling exercise is likely to raise warning flags about
inconsistencies or definitional problems that could destroy the usefulness of
later analysis.

Data visualization tools can be very helpful during the initial exploration of
a database. Figure 3.6 shows some data from the 2000 census of the state of
New York. (This dataset may be downloaded from the companion Web site at
www.data-miners.com/companion where you will also find suggested exer­
cises that make use of it.) The red bars indicate the proportion of towns in the
county where more than 15 percent of homes are heated by wood. (In New


. 18
( 137 .)