decision trees. The decision tree will choose one synonymous variable, which

can then be ignored. If the decision tree tool lets you see alternative splits, then

all such variables can be found at once.

Model Roles in Modeling

Columns contain data with data types. In addition, columns have roles with

respect to the data mining algorithms. Three important roles are:

Input columns. These are columns that are used as input into the model.

Target column(s). This column or set of columns is only used when build

ing predictive models. These are what is interesting, such as propensity

to buy a particular product, likelihood to respond to an offer, or proba

bility of remaining a customer. When building undirected models, there

does not need to be a target.

Ignored columns. These are columns that are not used.

Different tools have different names for these roles. Figure 17.4 shows how

a column is removed from consideration in Angoss Knowledge Studio.

548 Chapter 17

Figure 17.4 Angoss Knowledge Studio supports several model

roles, such as ignoring a column when building a model.

T I P Ignored columns play a very important role in clustering. Since ignored

columns are not used to build the clusters, their distribution in the clusters can

be very informative. By ignoring columns such as customer profitability or

response flags, we can see how these “ignored” columns are distributed in the

clusters. And we might just discover something very interesting about customer

profit or responders.

There are some more advanced roles as well, which are used under specific

circumstances. Figure 17.5 shows the many model roles available in SAS

Enterprise Miner. These model roles include:

Identification column. These are columns that uniquely identify each row.

In general, these columns are ignored for data mining purposes, but are

important for scoring.

Weight column. This is a column that specifies a “weight” to be applied to

each row. This is a way of creating a weighted sample by including the

weight in the data.

Cost column. The cost column specifies a cost associated with a row. For

instance, if we are building a customer retention model, then the “cost”

might include an estimate of each customer™s value. Some tools can use

this information to optimize the models that they are building.

The additional model roles available in the tool are specific to SAS Enter

prise Miners.

Preparing Data for Mining 549

Figure 17.5 SAS Enterprise Miner has a wide range of available model roles.

Variable Measures

Variables appear in data and have some important properties. Although data

bases are concerned with the type of variables (and we™ll return to this topic in

a moment), data mining is concerned with the measure of variables. It is the

measure that determines how the algorithms treat the values. The following

measures are important for data mining:

Categorical variables can be compared for equality but there is no mean

––

ingful ordering. For example, state abbreviations are categorical. The

fact that Alabama is next to Alaska alphabetically does not mean that

they are closer to each other than Alabama and Tennessee, which share

a geographic border but appear much further apart alphabetically.

Ordered variables can be compared with equality and with greater than

––

and less than. Classroom grades, which range from A to F, are an exam

ple of ordered values.

Interval variables are ordered and support the operation of subtraction

––

(although not necessarily any other mathematical operation such as

addition and multiplication). Dates and temperatures are examples of

intervals.

550 Chapter 17

True numeric variables are interval variables that support addition and

––

other mathematical operations. Monetary amounts and customer

tenure (measured in days) are examples of numeric variables.

The difference between true numerics and intervals is subtle. However, data

mining algorithms treat both of these the same way. Also, note that these mea

sures form a hierarchy. Any ordered variable is also categorical, any interval is

also categorical, and any numeric is also interval.

There is a difference between measure and data type. A numeric variable,

for instance, might represent a coding scheme”say for account status or even

for state abbreviations. Although the values look like numbers, they are really

categorical. Zip codes are a common example of this phenomenon.

Some algorithms expect variables to be of a certain measure. Statistical

regression and neural networks, for instance, expect their inputs to be numeric.

So, if a zip code field is included and stored as a number, then the algorithms

treat its values as numeric, generally not a good approach. Decision trees, on

the other hand, treat all their inputs as categorical or ordered, even when they

are numbers.

Measure is one important property. In practice, variables have associated

types in databases and file layouts. The following sections talk about data

types and measures in more detail.

Numbers

Numbers usually represent quantities and are good variables for modeling

purposes. Numeric quantities have both an ordering (which is used by deci

sion trees) and an ability to perform arithmetic (used by other algorithms such

as clustering and neural networks). Sometimes, what looks like a number

really represents a code or an ID. In such cases, it is better to treat the number

as a categorical value (discussed in the next two sections), since the ordering

and arithmetic properties of the numbers may mislead data mining algorithms

attempting to find patterns.

There are many different ways to transform numeric quantities. Figure 17.6

illustrates several common methods:

Normalization. The resulting values are made to fall within a certain

range, for example, by subtracting the minimum value and dividing by

the range. This process does not change the form of the distribution of

the values. Normalization can be useful when using techniques that per

form mathematical operations such as multiplication directly on the val

ues, such as neural networks and K-means clustering. Decision trees are

unaffected by normalization, since the normalization does not change

the order of the values.

Preparing Data for Mining 551

Original Data Normalized to [0, 1]

7,000 1.0

6,000

0.8

5,000

0.6

4,000

3,000 0.4

2,000

0.2

1,000

0 0.0

Time Time

Standardized Binned as Deciles

4 10

9

3

8

7

2

6

Decile

5

1

4

3

0 2

1

-1 0

Time Time

Figure 17.6 Normalization, standardization, and binning are typical ways to transform

a numeric variable.

Standardization. This transforms the values into the number of standard

deviations from the mean, which gives a good sense of how unexpected

the value is. The arithmetic is easy”subtract the average value and

divide by the standard deviation. These standardized values are also

called z-scores. As with normalization, standardization does not affect

the ordering, so it has no effect on decision trees.

Equal-width binning. This transforms the variables into ranges that are

fixed in width. The resulting variable has roughly the same distribution

as the original variable. However, binning values affects all data mining

algorithms.

Equal-height binning. This transforms the variables into n-tiles (such as

quintiles or deciles) so that the same number of records falls into each

bin. The resulting variable has a uniform distribution.

Perhaps unexpectedly, binning values can improve the performance of data

mining algorithms. In the case of neural networks, binning is one of several

ways of reducing the influence of outliers, because all outliers are grouped

together into the same bin. In the case of decision trees, binned variables may

result in child nodes having more equal sizes at high levels of the tree (that is,

instead of one child getting 5 percent of the records and the other 95 percent,

with the corresponding binned variable one might get 20 percent and the other

80 percent). Although the split on the binned variables is not optimal, subse

quent splits may produce better trees.

552 Chapter 17

Dates and Times

Dates and times are the most common examples of interval variables.. These

variables are very important, because they introduce the time element into

data analysis. Often, the importance of date and time variables is that they pro

vide sequence and timestamp information for other variables, such as cause

and resolution of the last complaint call.

Because there is a myriad of different formats, working with dates and time

stamps can be difficult. Excel has fifteen different date formats prebuilt for

cells, and the ability to customize many more. One typical internal format for

dates and times is as a single number”the number of days or seconds since

some date in the past. When this is the case, data mining algorithms treat dates

as numbers. This representation is adequate for the algorithms to detect what

Y

happened earlier and later. However, it misses other important properties,

FL

which are worth adding into the data:

Time of day

––

AM

Day of the week, and whether it is a workday or weekend

––

Month and season

––

Holidays

––

TE

In his book The Data Warehouse Toolkit (Wiley, 2002), Ralph Kimball strongly

recommends that a calendar be one of the first tables built for a data ware

house. We strongly agree with this recommendation, since the attributes of the

calendar are often important for data mining work.

One challenge when working with dates and times is time zones. Especially

in the interconnected world of the Web, the time stamp is generally the time

stamp from the server computer, rather than the time where the customer is. It

is worth remembering that the customer who is visiting the Web site repeat

edly in the wee hours of the morning might actually be a Singapore lunchtime

surfer rather than a New York night owl.

Fixed-Length Character Strings

Fixed-length character strings usually represent categorical variables, which

take on a known set of values. It is always worth comparing the actual values

that appear in the data to the list of legal values”to check for illegal values, to

verify that the field is always populated, and to see which values are most and

least frequent.

Fixed-length character strings often represent codes of some sort. Helpfully,

there are often reference tables that describe what these codes mean. The ref

erence tables can be particularly useful for data mining, because they provide

hierarchies and other attributes that might not be apparent just looking at the

code itself.

Team-Fly®

Preparing Data for Mining 553

Character strings do have an ordering”the alphabetical ordering. How