. 110
( 137 .)


ever, as the earlier example with Alabama and Alaska shows, this ordering
might be useful for librarians, but it is less useful for data miners. When there
is a sensible ordering, it makes sense to replace the codes with numbers. For
instance, one company segmented customers into three groups: NEW cus­
tomers with less than 1 year of tenure, MARGINAL customers with between 1
and 2 years, and CORE customers with more than 2 years. These categories
clearly have an ordering. In practice, one way to incorporate the ordering
would be to map the groups into the numbers 1, 2, and 3. A better way would
be to include that actual tenure for data mining purposes, although reports
could still be based on the tenure groups.
Data mining algorithms usually perform better when there are fewer cate­
gories rather than more. One way to reduce the number of categories is to use
attributes of the codes, rather than the codes themselves. For instance, a
mobile phone company is likely to have customers with hundreds of different
handset equipment codes (although just a few popular varieties will account
for the vast bulk of customers). Instead of using each model independently,
include features such as handset weight, original release date of the handset,
and the features it provides.
Zip codes in the United States provide a good example of a potentially use­
ful variable that takes on many values. One way to reduce the number of val­
ues is to use only the first three characters (digits). These are the sectional
center facility (SCF), which is usually at the center of a county or large town.
They maintain most of the geographic information in the zip code but at a
higher level. Even though the SCF and zip codes are numbers, they need to be
treated as codes. One clue is that the leading “0” in the zip code is important”
the zip code of Data Miners, Inc. is 02114, and it would not make sense with­
out the leading “0”.
Some businesses are regional; consequently almost all customers are located
in a small number of zip codes. However, there still may be many other cus­
tomers spread thinly in many other places. In this case, it might be best to
group all the rare values into a single “other” category. Another and often bet­
ter approach, is to replace the zip codes with information about the zip code.
There could be several items of information, such as median income and aver­
age home value (from the census bureau), along with penetration and
response rate to a recent marketing campaign. Replacing string values with
descriptive numbers is a powerful way to introduce business knowledge into

T I P Replacing categorical variables with numeric summaries of the categories”
such as product penetration within a zip code”improves data mining models and
solves the problem of working with categoricals that have too many values.
554 Chapter 17

Neural networks and K-means clustering are examples of algorithms that
want their inputs to be intervals or true numerics. This poses a problem for
strings. The naïve approach is to assign a number to each value. However, the
numbers have additional information that is not present in the codes, such as
ordering. This spurious ordering can hide information in the data. A better
approach is to create a set of flags, called indicator variables, for each possible
value. Although this increases the number of variables, it eliminates the prob­
lem of spurious ordering and improves results. Neural network tools often do
this automatically.
In summary, there are several ways to handle fixed-length character strings:
If there are just a few values, then the values can be used directly.

If the values have a useful ordering, then the values can be turned into

rankings representing the ordering.
If there are reference tables, then information describing the code is

likely to be more useful.
If a few values predominate, but there are many values, then the rarer

values can be grouped into an “other” category.
For neural networks and other algorithms that expect only numeric

inputs, values can be mapped to indicator variables.
A general feature of these approaches is that they incorporate domain infor­
mation into the coding process, so the data mining algorithms can look for
unexpected patterns rather than finding out what is already known.

IDs and Keys
The purpose of some variables is to provide links to other records with more
information. IDs and keys are often stored as numbers, although they may also
be stored as character strings. As a general rule, such IDs and keys should not
be used directly for modeling purposes.
A good example of a field that should generally be ignored for data mining
purposes are account numbers. The irony is that such fields may improve mod­
els, because account numbers are not assigned randomly. Often, they are
assigned sequentially, so older accounts have lower account numbers; possibly
they are based on acquisition channel, so all Web accounts have higher num­
bers than other accounts. It is better to include the relevant information explic­
itly in the customer signature, rather than relying on hidden business rules.
In some cases, IDs do encode meaningful information. In these cases, the
information should be extracted to make it more accessible to the data mining
algorithms. Here are some examples.
Telephone numbers contain country codes, area codes, and exchanges”these
all contain geographical information. The standard 10-digit number in North
Preparing Data for Mining 555

American starts with a three-digit area code followed by a three-digit exchange
and a four-digit line number. In most databases, the area code provides good geo­
graphic information. Outside North America, the format of telephone numbers
differs from place to place. In some cases, the area codes and telephone numbers
are of variable length making it more difficult to extract geographic information.
Uniform product codes (Type A UPC) are the 12-digit codes that identify many
of the products passed in front of scanners. The first six digits are a code for the
manufacturer, the next five encode the specific product. The final digit has no
meaning. It is a check digit used to verify the data.
Vehicle identification numbers are the 17-character codes inscribed on automo­
biles that describe the make, model, and year of the vehicle. The first character
describes the country of origin. The second, the manufacturer. The third is the
vehicle type, with 4 to 8 recording specific features of the vehicle. The 10th is
the model year; the 11th is the assembly plant that produced the vehicle. The
remaining six are sequential production numbers.
Credit card numbers have 13 to 16 digits. The first few digits encode the card
network. In particular, they can distinguish American Express, Visa, Master-
Card, Discover, and so on. Unfortunately, the use of the rest of the numbers
depends on the network, so there are no uniform standards for distinguishing
gold cards from platinum cards, for instance. The last digit, by the way, is a
check digit used for rudimentary verification that the credit card number is
valid. The algorithm for check digit is called the Luhn Algorithm, after the IBM
researcher who developed it.
National ID numbers in some countries (although not the United States)
encode the gender and data of birth of the individual. This is a good and accu­
rate source of this demographic information, when it is available.

Although we want to get to know the customers, the goal of data mining is not
to actually meet them. In general, names are not a useful source of information
for data mining. There are some cases where it might be interesting to classify
names according to ethnicity (such as Hispanic names or Asian names) when
trying to reach a particular market or by gender for messaging purposes.
However, such efforts are at best very rough approximations and not widely
used for modeling purposes.

Addresses describe the geography of customers, which is very important for
understanding customer behavior. Unfortunately, the post office can under­
stand many different variations on how addresses are written. Fortunately,
there are service bureaus and software that can standardize address fields.
556 Chapter 17

One of the most important uses of an address is to understand when two
addresses are the same and when they are different. For instance, is the deliv­
ery address for a product ordered on the Web the same as the billing address
of the credit card? If not, there is a suggestion that the purchase is a gift (and
the suggestion is even stronger if the distance between the two is great and the
giver pays for gift wrapping!).
Other than finding exact matches, the entire address itself is not particularly
useful; it is better to extract useful information and present it as additional
fields. Some useful features are:
Presence or absence of apartment numbers



Zip code

The last three are typically stored in separate fields. Because geography
often plays such an important role in understanding customer behavior, we
recommend standardizing address fields and appending useful information
such as census block group, multi-unit or single unit building, residential or
business address, latitude, longitude, and so on.

Free Text
Free text poses a challenge for data mining, because these fields provide a
wealth of information, often readily understood by human beings, but not by
automated algorithms. We have found that the best approach is to extract fea­
tures from the text intelligently, rather than presenting the entire text fields to
the computer.
Text can come from many sources, such as:
Doctors™ annotations on patient visits

Memos typed in by call-center personnel

Email sent to customer service centers

Comments typed into forms, whether Web forms or insurance forms

Voice recognition algorithms at call centers

Sources of text in the business world have the property that they are
ungrammatical and filled with misspellings and abbreviations. Human beings
generally understand them, but it is very difficult to automate this under­
standing. Hence, it is quite difficult to write software that automatically filters
spam even though people readily recognize spam.
Preparing Data for Mining 557

Our recommended approach is to look for specific features by looking for
specific substrings. For instance, once upon a time, a Jewish group was boy­
cotting a company because of the company™s position on Israel. Memo fields
typed in by call-center service reps were the best source of information on why
customers were stopping. Unfortunately, these fields did not uniformly say
“Cancelled due to Israel policy.” In fact, many of the comments contained ref­
erences to “Isreal,” “Is rael,” “Palistine” [sic], and so on. Classifying the text
memos required looking for specific features in the text (in this case, the pres­
ence of “Israel,” “Isreal,” and “Is rael” were all used) and then analyzing the

Binary Data (Audio, Image, Etc.)
Not surprisingly, there are other types of data that do not fall into these nice
categories. Audio and images are becoming increasingly common. And data
mining tools do not generally support them.
Because these types of data can contain a wealth of information, what can be
done with them? The answer is to extract features into derived variables.
However, such feature extraction is very specific to the data being used and is
outside the scope of this book.

Data for Data Mining
Data mining expects data to be in a particular format:
All data should be in a single table.

Each row should correspond to an entity, such as a customer, that is

relevant to the business.
Columns with a single value should be ignored.

Columns with a different value for every column should be ignored”

although their information may be included in derived columns.
For predictive modeling, the target column should be identified and all

synonymous columns removed.
Alas, this is not how data is found in the real world. In the real world, data
comes from source systems, which may store each field in a particular way.
Often, we want to replace fields with values stored in reference tables, or to
extract features from more complicated data types. The next section talks
about putting this data together into a customer signature.
558 Chapter 17

Constructing the Customer Signature
Building the customer signature, especially the first time, is a very incremental
process. At a minimum, customer signatures need to be built at least two
times”once for building the model and once for scoring it. In practice, explor-
ing data and building models suggests new variables and transformations, so
the process is repeated many times. Having a repeatable process simplifies the
data mining work.
The first step in the process, shown in Figure 17.7, is to identify the available
sources of data. After all, the customer signature is a summary, at the customer
level, of what is known about each customer. The summary is based on avail-
able data. This data may reside in a data warehouse. It might equally well
reside in operational systems and some might be provided by outside ven-
dors. When doing predictive modeling, it is particularly important to identify
where the target variable is coming from.
The second step is identifying the customer. In some cases, the customer is
at the account level. In others, the customer is at the individual or household
level. In some cases, the signature may have nothing to do with a person at all.
We have used signatures for understanding products, zip codes, and counties,
for instance, although the most common use is for accounts and households.


. 110
( 137 .)