. 20
( 137 .)


Data Mining Methodology and Best Practices 71

It may not be immediately obvious why some recent data”from the not-so-
distant past”is not used in a particular customer signature. The answer is that
when the model is applied in the present, no data from the present is available
as input. The diagram in Figure 3.8 makes this clearer.
If a model were built using data from June (the not-so-distant past) in order
to predict July (the recent past), then it could not be used to predict September
until August data was available. But when is August data available? Certainly
not in August, since it is still being created. Chances are, not in the first week
of September either, since it has to be collected and cleaned and loaded and
tested and blessed. In many companies, the August data will not be available
until mid-September or even October, by which point nobody will care about
predictions for September. The solution is to include a month of latency in the
model set.

Partitioning the Model Set
Once the preclassified data has been obtained from the appropriate time-
frames, the methodology calls for dividing it into three parts. The first part, the
training set, is used to build the initial model. The second part, the validation
set1, is used to adjust the initial model to make it more general and less tied to
the idiosyncrasies of the training set. The third part, the test set, is used to
gauge the likely effectiveness of the model when applied to unseen data. Three
sets are necessary because once data has been used for one step in the process,
it can no longer be used for the next step because the information it contains
has already become part of the model; therefore, it cannot be used to correct or

January February March April May June July August September October

7 6 5 4 3 2 1 Month

Model Building Time

7 6 5 4 3 2 1

Model Scoring Time

Figure 3.8 Time when the model is built compared to time when the model is used.
72 Chapter 3

People often find it hard to understand why the training set and validation
set are “tainted” once they have been used to build a model. An analogy may
help: Imagine yourself back in the fifth grade. The class is taking a spelling
test. Suppose that, at the end of the test period, the teacher asks you to estimate
your own grade on the quiz by marking the words you got wrong. You will
give yourself a very good grade, but your spelling will not improve. If, at the
beginning of the period, you thought there should be an ˜e™ at the end of
“tomato,” nothing will have happened to change your mind when you
grade your paper. No new information has entered the system. You need a val­
idation set!
Now, imagine that at the end of the test the teacher allows you to look at the
papers of several neighbors before grading your own. If they all agree that
“tomato” has no final ˜e,™ you may decide to mark your own answer wrong. If

the teacher gives the same quiz tomorrow, you will do better. But how much

better? If you use the papers of the very same neighbors to evaluate your per­
formance tomorrow, you may still be fooling yourself. If they all agree that
“potatoes” has no more need of an ˜e™ than “tomato,” and you have changed
your own guess to agree with theirs, then you will overestimate your actual
grade on the second quiz as well. That is why the test set should be different
from the validation set.

For predictive models, the test set should also come from a different time
period than the training and validation sets. The proof of a model™s stability is
in its ability to perform well month after month. A test set from a different time
period, often called an out of time test set, is a good way to verify model stabil­
ity, although such a test set is not always available.

Step Five: Fix Problems with the Data
All data is dirty. All data has problems. What is or isn™t a problem varies with
the data mining technique. For some, such as decision trees, missing values,
and outliers do not cause too much trouble. For others, such as neural net­
works, they cause all sorts of trouble. For that reason, some of what we have to
say about fixing problems with data can be found in the chapters on the tech­
niques where they cause the most difficulty. The rest of what we have to say on
this topic can be found in Chapter 17 in the section called “The Dark Side of
The next few sections talk about some of the common problems that need to
be fixed.

Data Mining Methodology and Best Practices 73

Categorical Variables with Too Many Values
Variables such as zip code, county, telephone handset model, and occupation
code are all examples of variables that convey useful information, but not in a
way that most data mining algorithms can handle. The problem is that while
where a person lives and what he or she does for work are important predic­
tors, there are so many possible values for the variables that carry this infor­
mation and so few examples in your data for most of the values, that variables
such as zip code and occupation end up being thrown away along with their
valuable information content.
Variables like these must either be grouped so that many classes that all
have approximately the same relationship to the target variable are grouped
together, or they must be replaced by interesting attributes of the zip code,
handset model or occupation. Replace zip codes by the zip code™s median
home price or population density or historical response rate or whatever else
seems likely to be predictive. Replace occupation with median salary for that
occupation. And so on.

Numeric Variables with Skewed
Distributions and Outliers
Skewed distributions and outliers cause problems for any data mining tech­
nique that uses the values arithmetically (by multiplying them by weights and
adding them together, for instance). In many cases, it makes sense to discard
records that have outliers. In other cases, it is better to divide the values into
equal sized ranges, such as deciles. Sometimes, the best approach is to trans­
form such variables to reduce the range of values by replacing each value with
its logarithm, for instance.

Missing Values
Some data mining algorithms are capable of treating “missing” as a value and
incorporating it into rules. Others cannot handle missing values, unfortu­
nately. None of the obvious solutions preserve the true distribution of the vari­
able. Throwing out all records with missing values introduces bias because it
is unlikely that such records are distributed randomly. Replacing the missing
value with some likely value such as the mean or the most common value adds
spurious information. Replacing the missing value with an unlikely value is
even worse since the data mining algorithms will not recognize that “999, say,
is an unlikely value for age. The algorithms will go ahead and use it.
74 Chapter 3

When missing values must be replaced, the best approach is to impute them
by creating a model that has the missing value as its target variable.

Values with Meanings That Change over Time
When data comes from several different points in history, it is not uncommon
for the same value in the same field to have changed its meaning over time.
Credit class “A” may always be the best, but the exact range of credit scores
that get classed as an “A” may change from time to time. Dealing with this
properly requires a well-designed data warehouse where such changes in
meaning are recorded so a new variable can be defined that has a constant
meaning over time.

Inconsistent Data Encoding
When information on the same topic is collected from multiple sources, the
various sources often represent the same data different ways. If these differ­
ences are not caught, they add spurious distinctions that can lead to erroneous
conclusions. In one call-detail analysis project, each of the markets studied had
a different way of indicating a call to check one™s own voice mail. In one city, a
call to voice mail from the phone line associated with that mailbox was
recorded as having the same origin and destination numbers. In another city,
the same situation was represented by the presence of a specific nonexistent
number as the call destination. In yet another city, the actual number dialed to
reach voice mail was recorded. Understanding apparent differences in voice
mail habits between cities required putting the data in a common form.
The same data set contained multiple abbreviations for some states and, in
some cases, a particular city was counted separately from the rest of the state.
If issues like this are not resolved, you may find yourself building a model of
calling patterns to California based on data that excludes calls to Los Angeles.

Step Six: Transform Data to Bring
Information to the Surface
Once the data has been assembled and major data problems fixed, the data
must still be prepared for analysis. This involves adding derived fields to
bring information to the surface. It may also involve removing outliers, bin­
ning numeric variables, grouping classes for categorical variables, applying
transformations such as logarithms, turning counts into proportions, and the
Data Mining Methodology and Best Practices 75

like. Data preparation is such an important topic that our colleague Dorian
Pyle has written a book about it, Data Preparation for Data Mining (Morgan
Kaufmann 1999), which should be on the bookshelf of every data miner. In this
book, these issues are addressed in Chapter 17. Here are a few examples of
such transformations.

Capture Trends
Most corporate data contains time series. Monthly snapshots of billing informa­
tion, usage, contacts, and so on. Most data mining algorithms do not understand
time series data. Signals such as “three months of declining revenue” cannot be
spotted treating each month™s observation independently. It is up to the data
miner to bring trend information to the surface by adding derived variables
such as the ratio of spending in the most recent month to spending the month
before for a short-term trend and the ratio of the most recent month to the same
month a year ago for a long-term trend.

Create Ratios and Other Combinations of Variables
Trends are one example of bringing information to the surface by combining
multiple variables. There are many others. Often, these additional fields are
derived from the existing ones in ways that might be obvious to a knowledge­
able analyst, but are unlikely to be considered by mere software. Typical exam­
ples include:

obesity_index = height2 / weight

PE = price / earnings

pop_density = population / area

rpm = revenue_passengers * miles

Adding fields that represent relationships considered important by experts
in the field is a way of letting the mining process benefit from that expertise.

Convert Counts to Proportions
Many datasets contain counts or dollar values that are not particularly inter­
esting in themselves because they vary according to some other value. Larger
households spend more money on groceries than smaller households. They
spend more money on produce, more money on meat, more money on pack­
aged goods, more money on cleaning products, more money on everything.
So comparing the dollar amount spent by different households in any one
76 Chapter 3

category, such as bakery, will only reveal that large households spend more. It
is much more interesting to compare the proportion of each household™s spend­
ing that goes to each category.
The value of converting counts to proportions can be seen by comparing
two charts based on the NY State towns dataset. Figure 3.9 compares the count
of houses with bad plumbing to the prevalence of heating with wood. A rela­
tionship is visible, but it is not strong. In Figure 3.10, where the count of houses
with bad plumbing has been converted into the proportion of houses with bad
plumbing, the relationship is much stronger. Towns where many houses have
bad plumbing also have many houses heated by wood. Does this mean that
wood smoke destroys plumbing? It is important to remember that the patterns
that we find determine correlation, not causation.

Figure 3.9 Chart comparing count of houses with bad plumbing to prevalence of heating
with wood.
Data Mining Methodology and Best Practices 77

Figure 3.10 Chart comparing proportion of houses with bad plumbing to prevalence of
heating with wood.


. 20
( 137 .)