. 117
( 137 .)


data may not result in the best models, but it does start a process that can
improve over time. For another thing, waiting for perfect data is often a way of
delaying a project so that nothing gets done.
This section covers some of the important issues that make working with
data a sometimes painful process.

Missing Values
Missing values refer to data that should be there but is not. In many cases, miss­
ing values are represented as NULLs in the data source, making it easy to iden­
tify them. However, be careful: NULL is sometimes an acceptable value. In this
case, we say that the value is empty rather than missing, although the two look
the same in source data. For instance, the stop code of an account might be
NULL, indicating that the account is still active. This information, which indi­
cates whether data is censored or not, is critical for survival analysis.
Another time when NULL is an acceptable value is when working with
overlay data describing demographics and other characteristics of customers
and prospects. In this case, NULL often has one of two meanings:
There is not enough evidence to indicate whether the field is true for

the individual. For instance, lack of subscriptions to golfing magazines
suggests the person is not a golfer, but does not prove it.
There is no matching record for the individual in the overlay data.

T I P When working with ovelay data, it is useful to replace NULLs with
alternative values, one meaning that the record does not match and the other
meaning that the value is unknown.

It is worth distinguishing between these situations. One way is to separate
the data where the records do not match, creating two different model sets.
The other is to replace the NULL values with alternative values, indicating
whether the failure to match is at the record level or the field level.
Preparing Data for Mining 591

Because customer signatures use so much aggregated data, they often con­
tain “0” for various features. So, missing data in the customer signatures is not
the most significant issue for the algorithms. However, this can be taken too
far. Consider a customer signature that has 12 months of billing data. Cus­
tomers who started in the past 12 months have missing data for the earlier
months. In this case, replacing the missing data with some arbitrary value is
not a good idea. The best thing is to split the model set into two pieces”those
with 12 months of tenure and those who are more recent.
When missing data is a problem, it is important to find its cause. For
instance, one database we encountered had missing data for customers™ start
dates. With further investigation, it turned out that these were all customers
who had started and ended their relationship prior to March 1999. Subsequent
use of this data source focused on either customers who started after this date
or who were active on this date. In another case, a transaction table was miss­
ing a particular type of transaction before a certain date. During the creation of
the data warehouse, different transactions were implemented at different
times. Only carefully looking at crosstabulations of transaction types by time
made it clear that one type was implemented much later than the rest.
In another case, the missing data in a data warehouse was just that”
missing because the data warehouse had failed to load it properly. When
there is such a clear cause, the database should be fixed, especially since mis­
leading data is worse than no data at all.
One approach to dealing with missing data is to try to fill in the values”for
example, with the average value or the most common value. Either of these
substitutions changes the distribution of the variable and may lead to poor
models. A more clever variation of this approach is to try to calculate the value
based on other fields, using a technique such as regression or neural networks.
We discourage such an approach as well, unless absolutely necessary, since the
field no longer means what it is supposed to mean.

WA R N I N G One of the worst ways to handle missing values is to replace
them with some “special” value such as 9999 or “1 that is supposed to stick
out due to its unreasonableness. Data mining algorithms will happily use these
values as if they were real, leading to incorrect results.

Usually data is missing for systematic reasons, as in the new customers sce­
nario mentioned earlier. A better approach is to split the model set into parts,
eliminating the missing fields from one data set. Although one data set has
more fields, neither will have missing values.
It is also important to understand whether the data is going to be missing in
the future. Sometimes the right approach is to build models on records that
have complete data (and hope that these records are sufficiently representative
of all records) and to have someone fix the data sources, eliminating this
headache in the future.
592 Chapter 17

Dirty Data
Dirty data refers to fields that contain values that might look correct, but are
not. These can often be identified because such values are outliers. For
instance, once upon a time, a company thought that it was very important for
their call-center reps to collect the birth dates of customers. They thought it
was so important that the input field on the screen was mandatory. When they
looked at the data, they were surprised to see that more than 5 percent of their
customers were born in 1911; and not just in 1911, but on November 11th. It
turns out that not all customers wanted to share their birth date, so the call-
center reps quickly learned that typing six “1”s was the quickest way to fill the
field (the day, month, and year eachtook two characters). The result: many cus­
tomers with the exact same birthday.

The attempt to collect accurate data often runs into conflict with efforts to

manage the business. Many stores offer discounts to customers who have
membership cards. What happens when a customer does not have a card? The
business rules probably say “no discount.” What may really happen is that a
store employee may enter a default number, so that customer can still qualify.
This friendly gesture leads to certain member numbers appearing to have
exceptionally high transaction volumes.

One company found several customers in Elizabeth, NJ with the zip code
07209. Unfortunately, the zip code does not exist, which was discovered when
analyzing the data by zip code and appending zip code information. The error
had not been discovered earlier because the post office can often figure out
how to route incorrectly addressed mail. Such errors can be fixed by using
software or an outside service bureau to standardize the address data.
What looks like dirty data might actually provide insight into the business.
A telephone number, for instance, should consist only of numbers. The billing
system for one regional telephone company stored the number as a string (this
is quite common actually). The surprise was several hundred “telephone num­
bers” that included alphabetic characters. Several weeks (!) after being asked
about this, the systems group determined that these were essentially calling
card numbers, not attached to a telephone line, that were used only for third-
party billing services.
Another company used media codes to determine how customers were
acquired. So, media codes starting with “W” indicated that customers came
from the Web, “D” indicated response to direct mail, and so on. Additional
characters in the code distinguished between particular banner ads and par­
ticular email campaigns. When looking at the data, it was surprising to dis­
cover Web customers starting as early as the 1980s. No, these were not
bleeding-edge customers. It turned out that the coding scheme for media
codes was created in October 1997. Earlier codes were essentially gibberish.
The solution was to create a new channel for analysis, the “pre-1998” channel.

Preparing Data for Mining 593

WA R N I N G Wthe most pernicious data problem are the ones you don™t know
about. For this reason, data mining cannot be performed in a vacuum; input

from business people and data analysts are critical for success.

All of these cases are examples where dirty data could be identified. The
biggest problems in data mining, though, are the unknown ones. Sometimes,
data problems are hidden by intervening systems. In particular, some data
warehouse builders abhor missing data. So, in an effort to clean data, they may
impute values. For instance, one company had more than half their loyal cus­
tomers enrolling in a loyalty program in 1998. The program has been around
longer, but the data was loaded into the data warehouse in 1998. Guess what?
For the participants in the initial load, the data warehouse builders simply put
in the current date, rather than the date when the customers actually enrolled.
The purpose of data mining is to find patterns in data, preferably interest­
ing, actionable patterns. The most obvious patterns are based on how the busi­
ness is run. Usually, the goal is to gain an understanding of customers more
than an understanding of how the business is run. To do this, it is necessary to
understand what was happening when the data was created.

Inconsistent Values
Once upon a time, computers were expensive, so companies did not have
many of them. That time is long past, and there are now many systems for
many different purposes. In fact, most companies have dozens or hundreds
of systems, some on the operational side, some on the decision-support side.
In such a world, it is inevitable that data in different systems does not always
One reason that systems disagree is that they are referring to different things.
Consider the start date for mobile telephone service. The order-entry system
might consider this the date that customer signs up for the service. An opera­
tional system might consider it the date that the service is activated. The billing
system might consider it the effective date of the first bill. A downstream deci-
sion-support system might have yet another definition. All of these dates
should be close to each other. However, there are always exceptions. The best
solution is to include all these dates, since they can all shed light on the busi­
ness. For instance, when are there long delays between the time a customer
signs up for the service and the time the service actually becomes effective?
Is this related to churn? A more common solution is to choose one of the dates
and call that the start date.
Another reason has to do with the good intentions of systems developers.
For instance, a decision-support system might keep a current snapshot of cus­
tomers, including a code for why the customer stopped. One code value might
indicate that some customers stopped for nonpayment; other code values
might represent other reasons”going to a competitor, not liking the service,
594 Chapter 17

and so on. However, it is not uncommon for customers who have stopped vol­
untarily to not pay their last bill. In this data source, the actual stop code was
simply overwritten. The longer ago that a customer stopped, greater the
chance that the original stop reason was subsequently overwritten when the
company determines”at a later time”that a balance is owed. The problem
here is that one field is being used for two different things”the stop reason
and nonpayment information. This is an example of poor data modeling that
comes back to bite the analysts.
A problem that arises when using data warehouses involves the distinction
between the initial loads and subsequent incremental loads. Often, the initial
load is not as rich in information, so there are gaps going back in time. For
instance, the start date may be correct, but there is no product or billing plan
for that date. Every source of data has its peculiarities; the best advice is to get
to know the data and ask lots of questions.

Computational Issues
Creating useful customer signatures requires considerable computational
power. Fortunately, computers are up to the task. The question is more which
system to use. There are several possibilities for doing the transformation work:
Source system, typically in databases of some sort (either operational or

decision support)
Data extraction tools (used for populating data warehouses and data

Special-purpose code (such as SAS, SPSS, S-Plus, Perl)

Data mining tools

Each of these has its own advantages and disadvantages.

Source Systems
Source systems are usually relational databases or mainframe systems. Often,
these systems are highly restricted, because they have many users. Such source
systems are not viable platforms for performing data transformations. Instead,
data is dumped (usually as flat files) from these systems and manipulated else­
In other cases, the databases may be available for ad hoc query use. Such
queries are useful for generating customer signatures because of the power of
relational databases. In particular, databases make it possible to:
Extract features from individual fields, even when these fields are dates

and strings
Preparing Data for Mining 595

Combine multiple fields using arithmetic operations

Look up values in reference tables

Summarize transactional data

Relational databases are not particularly good at pivoting fields, although as
shown earlier in this chapter, they can be used for that as well.
On the downside, expressing transformations in SQL can be cumbersome,
to say the least, requiring considerable SQL expertise. The queries may extend
for hundreds of lines, filled with subqueries, joins, and aggregations. Such
queries are not particularly readable, except by whoever constructed them.
These queries are also killer queries, although databases are becoming increas­
ingly powerful and able to handle them. On the plus side, databases do take
advantage of parallel hardware, a big advantage for transforming data.


. 117
( 137 .)