<<

. 109
( 137 .)



>>

T I P An easy way to find columns synonymous with the target is to build
decision trees. The decision tree will choose one synonymous variable, which
can then be ignored. If the decision tree tool lets you see alternative splits, then
all such variables can be found at once.




Model Roles in Modeling
Columns contain data with data types. In addition, columns have roles with
respect to the data mining algorithms. Three important roles are:
Input columns. These are columns that are used as input into the model.
Target column(s). This column or set of columns is only used when build­
ing predictive models. These are what is interesting, such as propensity
to buy a particular product, likelihood to respond to an offer, or proba­
bility of remaining a customer. When building undirected models, there
does not need to be a target.
Ignored columns. These are columns that are not used.
Different tools have different names for these roles. Figure 17.4 shows how
a column is removed from consideration in Angoss Knowledge Studio.
548 Chapter 17




Figure 17.4 Angoss Knowledge Studio supports several model
roles, such as ignoring a column when building a model.


T I P Ignored columns play a very important role in clustering. Since ignored
columns are not used to build the clusters, their distribution in the clusters can
be very informative. By ignoring columns such as customer profitability or
response flags, we can see how these “ignored” columns are distributed in the
clusters. And we might just discover something very interesting about customer
profit or responders.

There are some more advanced roles as well, which are used under specific
circumstances. Figure 17.5 shows the many model roles available in SAS
Enterprise Miner. These model roles include:
Identification column. These are columns that uniquely identify each row.
In general, these columns are ignored for data mining purposes, but are
important for scoring.
Weight column. This is a column that specifies a “weight” to be applied to
each row. This is a way of creating a weighted sample by including the
weight in the data.
Cost column. The cost column specifies a cost associated with a row. For
instance, if we are building a customer retention model, then the “cost”
might include an estimate of each customer™s value. Some tools can use
this information to optimize the models that they are building.
The additional model roles available in the tool are specific to SAS Enter­
prise Miners.
Preparing Data for Mining 549




Figure 17.5 SAS Enterprise Miner has a wide range of available model roles.




Variable Measures
Variables appear in data and have some important properties. Although data­
bases are concerned with the type of variables (and we™ll return to this topic in
a moment), data mining is concerned with the measure of variables. It is the
measure that determines how the algorithms treat the values. The following
measures are important for data mining:
Categorical variables can be compared for equality but there is no mean­
––

ingful ordering. For example, state abbreviations are categorical. The
fact that Alabama is next to Alaska alphabetically does not mean that
they are closer to each other than Alabama and Tennessee, which share
a geographic border but appear much further apart alphabetically.
Ordered variables can be compared with equality and with greater than
––

and less than. Classroom grades, which range from A to F, are an exam­
ple of ordered values.
Interval variables are ordered and support the operation of subtraction
––

(although not necessarily any other mathematical operation such as
addition and multiplication). Dates and temperatures are examples of
intervals.
550 Chapter 17


True numeric variables are interval variables that support addition and
––

other mathematical operations. Monetary amounts and customer
tenure (measured in days) are examples of numeric variables.
The difference between true numerics and intervals is subtle. However, data
mining algorithms treat both of these the same way. Also, note that these mea­
sures form a hierarchy. Any ordered variable is also categorical, any interval is
also categorical, and any numeric is also interval.
There is a difference between measure and data type. A numeric variable,
for instance, might represent a coding scheme”say for account status or even
for state abbreviations. Although the values look like numbers, they are really
categorical. Zip codes are a common example of this phenomenon.
Some algorithms expect variables to be of a certain measure. Statistical
regression and neural networks, for instance, expect their inputs to be numeric.
So, if a zip code field is included and stored as a number, then the algorithms
treat its values as numeric, generally not a good approach. Decision trees, on
the other hand, treat all their inputs as categorical or ordered, even when they
are numbers.
Measure is one important property. In practice, variables have associated
types in databases and file layouts. The following sections talk about data
types and measures in more detail.

Numbers
Numbers usually represent quantities and are good variables for modeling
purposes. Numeric quantities have both an ordering (which is used by deci­
sion trees) and an ability to perform arithmetic (used by other algorithms such
as clustering and neural networks). Sometimes, what looks like a number
really represents a code or an ID. In such cases, it is better to treat the number
as a categorical value (discussed in the next two sections), since the ordering
and arithmetic properties of the numbers may mislead data mining algorithms
attempting to find patterns.
There are many different ways to transform numeric quantities. Figure 17.6
illustrates several common methods:
Normalization. The resulting values are made to fall within a certain
range, for example, by subtracting the minimum value and dividing by
the range. This process does not change the form of the distribution of
the values. Normalization can be useful when using techniques that per­
form mathematical operations such as multiplication directly on the val­
ues, such as neural networks and K-means clustering. Decision trees are
unaffected by normalization, since the normalization does not change
the order of the values.
Preparing Data for Mining 551


Original Data Normalized to [0, 1]
7,000 1.0
6,000
0.8
5,000
0.6
4,000
3,000 0.4
2,000
0.2
1,000
0 0.0
Time Time



Standardized Binned as Deciles
4 10

9

3
8

7

2
6





Decile
5

1
4
3
0 2
1
-1 0
Time Time
Figure 17.6 Normalization, standardization, and binning are typical ways to transform
a numeric variable.


Standardization. This transforms the values into the number of standard
deviations from the mean, which gives a good sense of how unexpected
the value is. The arithmetic is easy”subtract the average value and
divide by the standard deviation. These standardized values are also
called z-scores. As with normalization, standardization does not affect
the ordering, so it has no effect on decision trees.
Equal-width binning. This transforms the variables into ranges that are
fixed in width. The resulting variable has roughly the same distribution
as the original variable. However, binning values affects all data mining
algorithms.
Equal-height binning. This transforms the variables into n-tiles (such as
quintiles or deciles) so that the same number of records falls into each
bin. The resulting variable has a uniform distribution.
Perhaps unexpectedly, binning values can improve the performance of data
mining algorithms. In the case of neural networks, binning is one of several
ways of reducing the influence of outliers, because all outliers are grouped
together into the same bin. In the case of decision trees, binned variables may
result in child nodes having more equal sizes at high levels of the tree (that is,
instead of one child getting 5 percent of the records and the other 95 percent,
with the corresponding binned variable one might get 20 percent and the other
80 percent). Although the split on the binned variables is not optimal, subse­
quent splits may produce better trees.
552 Chapter 17


Dates and Times
Dates and times are the most common examples of interval variables.. These
variables are very important, because they introduce the time element into
data analysis. Often, the importance of date and time variables is that they pro­
vide sequence and timestamp information for other variables, such as cause
and resolution of the last complaint call.
Because there is a myriad of different formats, working with dates and time
stamps can be difficult. Excel has fifteen different date formats prebuilt for
cells, and the ability to customize many more. One typical internal format for
dates and times is as a single number”the number of days or seconds since
some date in the past. When this is the case, data mining algorithms treat dates
as numbers. This representation is adequate for the algorithms to detect what




Y
happened earlier and later. However, it misses other important properties,




FL
which are worth adding into the data:
Time of day
––
AM
Day of the week, and whether it is a workday or weekend
––

Month and season
––

Holidays
––
TE

In his book The Data Warehouse Toolkit (Wiley, 2002), Ralph Kimball strongly
recommends that a calendar be one of the first tables built for a data ware­
house. We strongly agree with this recommendation, since the attributes of the
calendar are often important for data mining work.
One challenge when working with dates and times is time zones. Especially
in the interconnected world of the Web, the time stamp is generally the time
stamp from the server computer, rather than the time where the customer is. It
is worth remembering that the customer who is visiting the Web site repeat­
edly in the wee hours of the morning might actually be a Singapore lunchtime
surfer rather than a New York night owl.

Fixed-Length Character Strings
Fixed-length character strings usually represent categorical variables, which
take on a known set of values. It is always worth comparing the actual values
that appear in the data to the list of legal values”to check for illegal values, to
verify that the field is always populated, and to see which values are most and
least frequent.
Fixed-length character strings often represent codes of some sort. Helpfully,
there are often reference tables that describe what these codes mean. The ref­
erence tables can be particularly useful for data mining, because they provide
hierarchies and other attributes that might not be apparent just looking at the
code itself.

Team-Fly®
Preparing Data for Mining 553


Character strings do have an ordering”the alphabetical ordering. How­

<<

. 109
( 137 .)



>>