. 111
( 137 .)


Identify a working
definition of customer.

Copy most recent
input data snapshot
of customer.

Pivot to produce
multiple months of data
for some data elements.

Calculate churn flag
for the prediction period.
Revisit the customer

Incorporate other
data sources.

Add derived variables.
Figure 17.7 Building customer signatures is an iterative process; start small and work
through the process step-by-step, as in this example for building a customer signature for
churn prediction.
Preparing Data for Mining 559

Once the customer has been identified, data sources need to be mapped to the
customer level. This may require additional lookup tables”for instance, to con­
vert accounts into households. It may not be possible to find the customers in the
available data. Such a situation requires revisiting the customer definition.
The key to building customer signatures is to start simple and build up. Pri­
oritize the data sources by the ease with which they map to the customer. Start
with the easiest one, and build the signature using it. You can use a signature
before all the data is put into it. While awaiting more complicated data trans­
formations, get your feet wet and understand what is available. When build­
ing customer signatures out of transactions, be sure to get all the transactions
associated with a particular customer.

Cataloging the Data
The data mining group at a mobile telecommunications company wants to
develop a churn model in-house. This churn model will predict churn for one
month, given a one-month lag time. So, if the data is available for February,
then the churn prediction is for April. Such a model provides time for gather­
ing the data and scoring new customers, since the February data is available
sometime in March.
At this company, there are several potential sources of data for the customer
signatures. All of these are kept in a data repository with 18 months of history.
Each file is an end-of-the-month snapshot”basically a dump of an operational
system into a data repository.
The UNIT_MASTER file contains a description of every telephone number
in service and a snapshot of what is known about the telephone number at the
end of the month. Examples of fields in this file are the telephone number,
billing account, billing plan, handset model, last billed date, and last payment.
The TRANS_MASTER file contains every transaction that occurs on a par­
ticular telephone number during the course of the month. These are account-
level transactions, which include connections, disconnections, handset
upgrades, and so on.
The BILL_MASTER file describes billing information at the account level.
Multiple handsets might be attached to the same billing account”particularly
for business customers and customers on family billing plans.
Although other sources of data were available in the company, these were
not immediately highlighted for use for the customer signature. One source,
for instance, was the call detail records”a record of every telephone call”that
is useful for predicting churn. Although this data was eventually used by the
data mining group, it was not part of this initial effort.
560 Chapter 17

Identifying the Customer
The data is typical of the real world. Although the focus might be on one type
of customer or another, the data has multiple groups. The sidebar “Residential
Versus Business Customers” talks about distinguishing between these two
The business problem being addressed in this example is churn. As shown
in Figure 17.8, the customer data model is rather complex, resulting in differ­
ent options for the definition of customer:
Telephone number

Customer ID

Billing account

This being the real world, though, it is important to remember that these
relationships are complex and change over time. Customers might change
their telephone numbers. Telephones might be added or removed from
accounts. Customers change handsets, and so on. For the purposes of building
the signature, the decision was to use the telephone number, because this was
how the business reported churn.

Sales Rep Sales Rep
Supervisor Supervisor

Customer Customer
Sales Rep ID

Sales Rep Contract

Telephone Number
Figure 17.8 The customer model is complicated and takes into account sales, billing, and
business hierarchy information.
Preparing Data for Mining 561


Often data mining efforts focus on one type of customer”such as residential
customers or small businesses. However, data for all customers is often mixed
together in operational systems and data warehouses. Typically, there are
multiple ways to distinguish between these types of customers:
— Often there is a customer type field, which has values like “residential”
and “small business.”
— There might be a sales hierarchy; some sales channels are business-only
while others are residential-only.
— Some billing plans are only for businesses; others are only for residential
— There might be business rules, so any customer with more than two lines
is considered business.
These examples illustrate the fact that there are typically several different
rules for distinguishing between different types of customers. Given the
opportunity to be inconsistent, most data sources will not fail. The different
rules select different subsets of customers.
Is this a problem? That depends on the particular model being worked on. The
hope is that the rules are all very close, so the customers included (or missed) by
one rule are essentially the same as those included by the others. It is important
to investigate whether or not this is true, and when the rules disagree.
What usually happens in practice is that one of the rules is predominant,
because that is the way the business is organized. So, although the customer
type might be interesting, the sales hierarchy is probably more important, since it
corresponds to people who have responsibility for different customer segments.
The distinction between businesses and residences is important for
prospects as well as customers. A long-distance telephone company sees many
calls traversing its network that were originated by customers of other carriers.
Their switches create call detail records containing the originating and
destination telephone numbers. Any domestic number that does not belong to
an existing customer belongs to a prospect. One long-distance company builds
signatures to describe the behavior of the unknown telephone numbers over
time by tracking such things as how frequently the number is seen, what times
of day and days of the week it is typically active, and the typical call duration.
Among other things, this signature is used to score the unknown telephone
numbers for the likelihood that they are businesses because business and
residential customers are attracted by different offers.

One simplification would be to focus only on customers whose accounts
have only one telephone number. Since the purpose is to build a model for res­
idential customers, this was a good way of simplifying the data model for get­
ting started. If the purpose were to build a model for business customers, a
better choice for the customer level would be the billing account level, since
562 Chapter 17

business customers often turn handsets and telephone numbers on and off.
However, churn in this case would mean the cancelation of the entire account,
rather than the cancelation of a single telephone number. These two situations
are the same for those residential customers who have only one line.

First Attempt
The first attempt to build the customer signature needs to focus on the sim­
plest data source. In this case, the simplest data source is the UNIT_MASTER
file, which conveniently stores data at the telephone number level, the level
being used for the customer signature.
It is worth pointing out two problems with this file and the customer

Customers may change their telephone number.

Telephone numbers may be reassigned to new customers.
These problems will be addressed later; the first customer signature is at the
telephone number level to get started. The process used to build the signature
has four steps: identifying the time frames, creating a recent snapshot, pivot­
ing columns, and calculating the target.

Identifying the Time Frames
The first attempt at building the customer signature needs to take into account
the time frame for the data, as discussed in Chapter 3. Figure 17.9 shows a
model time chart for this data. The ultimate model set should have more than
one time frame in it. However, the first attempt focuses on only one time frame.
The time frame defined churn during 1 month”August. All of the input
data come from at least 1 month before. The cutoff date is June 30, in order to
provide 1 month of latency.

Taking a Recent Snapshot
The most recent snapshot of data is defined by the cutoff date. These fields in
the signature describe the most recent information known about a customer
before he or she churned (or did not churn).

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
4 3 2 1
4 3 2 1
4 3 2 1

Figure 17.9 A model time chart shows the time frame for the input columns and targets
when building a customer signature.

Preparing Data for Mining 563

This is a set of fields from the UNIT_MASTER file for June”fields such as
the handset type, billing plan, and so on. It is important to keep the time frame
in mind when filling the customer signature. It is a good idea to use a naming
convention to avoid confusion. In this case, all the fields might have a suffix of
“_01,” indicating that they are from the most recent month of input data.

T I P Use a naming convention when building the customer signature to
indicate the time frame for each variable. For instance, the most recent month
of input data would have a “_01” suffix; the month before, “_02”; and so on.

At this point, presumably not much is known about the fields, so descriptive
information is useful. For instance, the billing plan might have a description,
monthly base, per-minute cost, and so on. All of these features are interesting
and of potential value for modeling”so it is reasonable to bring them into the
model set. Although descriptions are not going to be used for modeling (codes
are much better), they help the data miners understand the data.

Pivoting Columns
Some of the fields in UNIT_MASTER represent data that is reported in a regu­
lar time series. For instance, bill amount has a value for every month, and each
of these values needs to be put into a separate column. These columns come
from different UNIT_MASTER records, one for June, one for May, one for
April, and so on. Using a naming convention, the fields would be, for example:
Last_billed_amount_01 for June (which may already be in the snapshot)

Last_billed_amount_02 for May

Last_billed_amount_03 for April

At this point, the customer signature is starting to take shape. Although the
input fields only come from one source, the appropriate fields have been cho­
sen as input and aligned in time.

Calculating the Target
A customer signature for predictive modeling would not be useful without a
target variable. Since the customer signature is going to be used for churn


. 111
( 137 .)