. 19
( 137 .)


York, a town is a subdivision of a county that may or may not include any
incorporated villages or cities. For instance, the town of Cortland is in West­
chester county and includes the village of Croton-on-Hudson, whereas the city
of Cortland is in Cortland County, in another part of the state.) The picture,
generated by software from Quadstone, shows at a glance that wood-burning
stoves are not much used to heat homes in the urbanized counties close to
New York City, but are popular in rural areas upstate.
66 Chapter 3




Figure 3.6 Prevalence of wood as the primary source of heat varies by county in New York

Compare Values with Descriptions
Look at the values of each variable and compare them with the description
given for that variable in available documentation. This exercise often reveals
that the descriptions are inaccurate or incomplete. In one dataset of grocery
purchases, a variable that was labeled as being an item count had many
noninteger values. Upon further investigation, it turned out that the field con­
tained an item count for products sold by the item, but a weight for items
sold by weight. Another dataset, this one from a retail catalog company,
included a field that was described as containing total spending over several
quarters. This field was mysteriously capable of predicting the target
variable”whether a customer had placed an order from a particular catalog
mailing. Everyone who had not placed an order had a zero value in the mys­
tery field. Everyone who had placed an order had a number greater than zero
in the field. We surmise that the field actually contained the value of the cus-
tomer™s order from the mailing in question. In any case, it certainly did not
contain the documented value.
Data Mining Methodology and Best Practices 67

Validate Assumptions
Using simple cross-tabulation and visualization tools such as scatter plots, bar
graphs, and maps, validate assumptions about the data. Look at the target
variable in relation to various other variables to see such things as response by
channel or churn rate by market or income by sex. Where possible, try to
match reported summary numbers by reconstructing them directly from the
base-level data. For example, if reported monthly churn is 2 percent, count up
the number of customers that cancel one month and see if it is around 2 per­
cent of the total.

T I P Trying to recreate reported aggregate numbers from the detail data that
supposedly goes into them is an instructive exercise. In trying to explain the
discrepancies, you are likely to learn much about the operational processes and
business rules behind the reported numbers.

Ask Lots of Questions
Wherever the data does not seem to bear out received wisdom or your own
expectations, make a note of it. An important output of the data exploration
process is a list of questions for the people who supplied the data. Often these
questions will require further research because few users look at data as care­
fully as data miners do. Examples of the kinds of questions that are likely to
come out of the preliminary exploration are:
Why are no auto insurance policies sold in New Jersey or

Why were some customers active for 31 days in February, but none

were active for more than 28 days in January?
Why were so many customers born in 1911? Are they really that old?

Why are there no examples of repeat purchasers?

What does it mean when the contract begin date is after the contract

end date?
Why are there negative numbers in the sale price field?

How can active customers have a non-null value in the cancelation

reason code field?
These are all real questions we have had occasion to ask about real data.
Sometimes the answers taught us things we hadn™t known about the client™s
industry. New Jersey and Massachusetts do not allow automobile insurers
much flexibility in setting rates, so a company that sees its main competitive
68 Chapter 3

advantage as smarter pricing does not want to operate in those markets. Other
times we learned about idiosyncrasies of the operational systems, such as the
data entry screen that insisted on a birth date even when none was known,
which lead to a lot of people being assigned the birthday November 11, 1911
because 11/11/11 is the date you get by holding down the “1” key and letting
it auto-repeat until the field is full (and no other keys work to fill in valid
dates). Sometimes we discovered serious problems with the data such as the
data for February being misidentified as January. And in the last instance, we
learned that the process extracting the data had bugs.

Step Four: Create a Model Set
The model set contains all the data that is used in the modeling process. Some
of the data in the model set is used to find patterns. Some of the data in the
model set is used to verify that the model is stable. Some is used to assess
the model™s performance. Creating a model set requires assembling data from
multiple sources to form customer signatures and then preparing the data for

Assembling Customer Signatures
The model set is a table or collection of tables with one row per item to be stud­
ied, and fields for everything known about that item that could be useful for
modeling. When the data describes customers, the rows of the model set are
often called customer signatures. Assembling the customer signatures from rela­
tional databases often requires complex queries to join data from many tables
and then augmenting it with data from other sources.
Part of the data assembly process is getting all data to be at the correct level
of summarization so there is one value per customer, rather than one value per
transaction or one value per zip code. These issues are discussed in Chapter 17.

Creating a Balanced Sample
Very often, the data mining task involves learning to distinguish between
groups such as responders and nonresponders, goods and bads, or members
of different customer segments. As explained in the sidebar, data mining algo­
rithms do best when these groups have roughly the same number of members.
This is unlikely to occur naturally. In fact, it is usually the more interesting
groups that are underrepresented.
Before modeling, the dataset should be made balanced either by sampling
from the different groups at different rates or adding a weighting factor so that
the members of the most popular groups are not weighted as heavily as mem­
bers of the smaller ones.
Data Mining Methodology and Best Practices 69


In standard statistical analysis, it is common practice to throw out outliers”
observations that are far outside the normal range. In data mining, however,
these outliers may be just what we are looking for. Perhaps they represent
fraud, some sort of error in our business procedures, or some fabulously
profitable niche market. In these cases, we don™t want to throw out the outliers,
we want to get to know and understand them!
The problem is that knowledge discovery algorithms learn by example. If
there are not enough examples of a particular class or pattern of behavior, the
data mining tools will not be able to come up with a model for predicting it. In
this situation, we may be able to improve our chances by artificially enriching
the training data with examples of the rare event.

Stratified Sampling Weights

00 01 02 03 04 05 06 07 08
00 01 02 03 04 05 06 07 08 09
10 11 12 13 14 15 16 17 18
10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28
20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48
40 41 42 43 44 45 46 47 48 49


02 08 09

11 16 19

24 25 29

30 38 39

42 46 49

When an outcome is rare, there are two ways to create a balanced sample.

For example, a bank might want to build a model of who is a likely prospect
for a private banking program. These programs appeal only to the very
wealthiest clients, few of whom are represented in even a fairly large sample of
bank customers. To build a model capable of spotting these fortunate
individuals, we might create a training set of checking transaction histories of a
population that includes 50 percent private banking clients even though they
represent fewer than 1 percent of all checking accounts.
Alternately, each private banking client might be given a weight of 1 and
other customers a weight of 0.01, so the total weight of the exclusive customers
equals the total weight of the rest of the customers (we prefer to have the
maximum weight be 1).
70 Chapter 3

Including Multiple Timeframes
The primary goal of the methodology is creating stable models. Among other
things, that means models that will work at any time of year and well into the
future. This is more likely to happen if the data in the model set does not all
come from one time of year. Even if the model is to be based on only 3 months
of history, different rows of the model set should use different 3-month win­
dows. The idea is to let the model generalize from the past rather than memo­
rize what happened at one particular time in the past.
Building a model on data from a single time period increases the risk of
learning things that are not generally true. One amusing example that the
authors once saw was an association rules model built on a single week™s worth
of point of sale data from a supermarket. Association rules try to predict items
a shopping basket will contain given that it is known to contain certain other
items. In this case, all the rules predicted eggs. This surprising result became
less so when we realized that the model set was from the week before Easter.

Creating a Model Set for Prediction
When the model set is going to be used for prediction, there is another aspect
of time to worry about. Although the model set should contain multiple time-
frames, any one customer signature should have a gap in time between the
predictor variables and the target variable. Time can always be divided into
three periods: the past, present, and future. When making a prediction, a
model uses data from the past to make predictions about the future.
As shown in Figure 3.7, all three of these periods should be represented in
the model set. Of course all data comes from the past, so the time periods in the
model set are actually the distant past, the not-so-distant past, and the recent
past. Predictive models are built be finding patterns in the distant past that
explain outcomes in the recent past. When the model is deployed, it is then
able to use data from the recent past to make predictions about the future.

Model Building Time

Not So
Distant Past Recent Past Present Future

Model Scoring Time
Figure 3.7 Data from the past mimics data from the past, present, and future.


. 19
( 137 .)