. 112
( 137 .)


modeling, the target needs to be whether or not the customer churned in
August. This is in the account status field for the August UNIT_MASTER
record. Note that only customers who were active on or before June 30 are
included in the model set. A customer that starts in July and cancels in August
is not included.
564 Chapter 17

Making Progress
Although quite rudimentary, the customer signature is ready for use in a
model set. Having a well-defined time frame, a target variable, and input vari­
ables, it is functional, even if minimally so. Although useful and a place to get
started, the signature is missing a few things.
First, the definition of customer does not take into account changes in tele­
phone numbers. The TRANS_MASTER file solves this problem, because it
keeps track of these types of changes on customers™ accounts. To fix the defini­
tion of customer requires creating a table, which has the original telephone
number on the account (with perhaps a counter, since a telephone number can
actually be reused). A typical row in this table would have the following
Telephone Number

Effective Date

End Date

Unique Customer Identifier

With this table, the customer identifier can be used instead of the telephone
number, so the customer signatures are robust with respect to changes in tele­
phone number.
Another shortcoming of the customer signature is its reliance on only one
data source. Additional data sources should be added in, one at a time, to build
a richer signature of customer behavior. The model set only has one time frame
of data. Additional time frames make models that are more stable. This cus­
tomer signature also lacks derived variables, which are the subject of much of
the rest of this chapter.

Practical Issues
There are some practical issues encountered when building customer signa­
tures. Customer signatures often bring together the largest sources of data and
perform complex operations on them. This becomes an issue in terms of com­
puting resources. Although the resulting model set probably has at most tens
or hundreds of megabytes, the data being summarized could be thousands of
times larger. For this reason, it is often a good idea to do as much of the pro­
cessing as possible in relational databases, because these can take advantage of
multiple processors and multiple disks at the same time.
Although the resulting queries are complicated, much of the work of
putting together the signatures can be done in SQL or in the database™s script­
ing language. This is useful not only because it increases efficiency, but also
because the code then resides in only one place”reducing the possibility of
Preparing Data for Mining 565

error and increasing the ability to find bugs when they occur. Alternatively, the
data can be extracted from the source and pieced together. Increasingly, data
mining tools are becoming better at manipulating data. However, this gener­
ally requires some amount of programming, in a language such as SAS, SPSS,
S-Plus, or Perl. The additional processing not only adds time to the effort, but
it also introduces a second level where bugs might creep in.
It is important when creating signatures to realize that data mining is an
iterative process that often requires rebuilding the signature. A good approach
is to create a template for pulling one time frame of data from the data sources,
and then to do multiple such pulls to create the model set. For the score set, the
same process can be used, since the score set closely resembles the model set.

Exploring Variables
Data exploration is critically intertwined with the data mining process. In
many ways, data mining and data exploration are complementary ways of
achieving the same goal. Where data mining tends to highlight the interesting
algorithms for finding patterns, data exploration focuses more on presenting
data so that people can intuit the patterns. When it comes to communicating
results, pretty pictures that show what is happening are often much more effec­
tive than dry tables of numbers. Similarly, when preparing data for data min­
ing, seeing the data provides insight into what is happening, and this insight
can help improve models.

Distributions Are Histograms
The place to start when looking at data is with histograms of each field; his­
tograms show the distribution of values in fields. Actually, there is a slight dif­
ference between histograms and distributions, because histograms count
occurrences, whereas distributions are normalized. However, for our purposes,
the similarities are much more important”histograms and distributions (or
strictly speaking, the density function associated with the distribution) have
similar shapes; it is only the scale of the Y-axis that changes.
Most data mining tools provide the ability to look at the values that a single
variable takes on as a histogram. The vertical axis is the number of times each
value occurs in the sample; the horizontal axis shows the various values.
Numeric variables are often binned when creating histograms. For the pur­
pose of exploring the variables, these bins should be of equal width and not of
equal height. Remember that equal-height binning creates bins that all contain
the same number of values. Bins containing similar numbers of records are
useful for modeling; however, they are less useful for understanding the vari­
ables themselves.
566 Chapter 17

Changes over Time
Perhaps the most revealing information becomes apparent when the time ele­
ment is incorporated into a histogram. In this case, only a single value of a
variable is used. The chart shows how the frequency of this value changes over
As an example, the chart in Figure 17.10 shows fairly clearly that something
happened during one March with respect to the value “DN.” This type of pat­
tern is important. In this case, the “DN” represents duplicate accounts that
needed to be canceled when two different systems were merged. In fact, we
stumbled across the explanation only after seeing such a patterns and asking
questions about what was happening during this time period.
The top chart shows the raw values, and that can be quite useful. The bot­
tom one shows the standardized values. The curves in the two charts have the
same shape; the only difference is the vertical scale. Remember that standard­
izing values converts them into the number of standard deviations from the
mean, so values outside the range of “2 to 2 are unusual; values less then “3 or
greater than 3 should be very rare. Visualizing the same data shows that the
peaks are many standard deviations outside expected values”and 14 stan­
dard deviations is highly suspect. The likelihood of this happening randomly
is so remote that the chart suggests that something external is affecting the
variable”something external like the one-time even of merging of two com­
puter systems, which is how the duplicate accounts were created.
Creating one cross-tabulation by time is not very difficult. Unfortunately,
however, there is not much support in data mining tools for this type of dia­
gram. They are easy to create in Excel or with a bit of programming in SAS,
SPSS, S-Plus, or just about any other programming language. The challenge is
that many such diagrams are needed”one for each value taken on by each cat­
egorical variable. For instance, it is useful to look at:
Different types of accounts opened over time.

Different reasons why customers stop over time.

Performance of certain geographies over time.

Performance of different channels over time.

Because these charts explicitly go back in time, they bring up issues of what
happened when. They can be useful for spotting particularly effective combi­
nations that might not otherwise be obvious”such as “oh, our Web banner
click-throughs go up after we do email campaigns.”
Preparing Data for Mining 567



Count 6,000




Standard Devs




Figure 17.10 This histogram suggests that something unusual was happening with this
stop code. The top diagram is the raw data; in the bottom one, the values are standardized.

Looking at variables over time is one example of a cross-tabulation. In general,
cross-tabulations show how frequently two variables occur with respect to
each other. Figure 17.11 shows a cross-tabulation between two variables, chan­
nel and credit card payment. The size of the bubble shows the proportion of
customers starting in the channel with that payment method. This is the same
data shown in Table 17.2.
Cross-tabulations without time show static images rather than trends. This
is useful, but trend information is usually even more useful.

Table 17.2 Cross Tabulation of Channels by Payment Method


DM 69,126 51,481

TM 50,105 249,208

WEB 67,830 29,608
568 Chapter 17




Credit Card Direct Bill
Figure 17.11 Cross-tabulations show relationships between variables.

Deriving Variables
There have been many examples of derived variables in this chapter and
throughout this book. Such variables are predigested, making it easier for data
mining algorithms to incorporate them into models. Perhaps more important,
derived variables make it possible to incorporate domain knowledge into the
data mining process. Put the domain information into the data so that the data
mining algorithms can use it to find patterns.
Because adding variables is central to any successful data mining project, it
is worth looking at the six basic ways that derived variables are calculated in a
bit of detail. These six methods are:
Extracting features from a single value

Combining values within a record (used, among other things, for cap­

turing trends)

Looking up auxiliary information in another table


Pivoting time-dependent data into multiple columns


Summarizing transactional records


Summarizing fields across the model set

Preparing Data for Mining 569

The following sections discuss these methods, giving examples of derived
variables and highlighting important points about computing them.


. 112
( 137 .)