<<

. 108
( 137 .)



>>

25
28
31
34
37
40
43
46
49
52
55
58
Duration

(Minutes)




250


200


This histogram shows a normal
150
Count




distribution with a mean of 50 and a
standard deviation of 10. Notice that
100
high and low values are very rare.
50


0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Value


Figure 17.2 Histograms show the distribution of data values.
544 Chapter 17


The distribution of the values provides important insights into the data. It
shows which values are common and which are less common. Just looking at
the distribution of values brings up questions”such as why an amount is nega­
tive or why some categorical values are not present. Although statisticians tend
to be more concerned with distributions than data miners, it is still important to
look at variable values. Here, we illustrate some special cases of distributions
that are important for data mining purposes, as well as the special case of vari­
ables synonymous with the target.

Columns with One Value
The most degenerate distribution is a column that has only one value. Unary-
valued columns, as they are more formally known, do not contain any infor­
mation that helps to distinguish between different rows. Because they lack any
information content, they should be ignored for data mining purposes.
Having only one value is sometimes a property of the data. It is not uncom­
mon, for instance, for a database to have fields defined in the database that are
not yet populated. The fields are only placeholders for future values, so all the
values are uniformly something such as “null” or “no” or “0.”
Before throwing out unary variables, check that NULLs are being counted
as values. Appended demographic variables sometimes have only a single
value or NULL when the value is not known. For instance, if the data provider
knows that someone is interested in golf”say because the person subscribes
to a golfing magazine or belongs to a country club”then the “golf-enthusiast”
flag would be set to “Y.”When there is no evidence, many providers set the
flag to NULL”meaning unknown”rather than “N.”

T I P When a variable has only one value, be sure (1) that NULL is being
included in the count of the number of values and (2) that other values were

not inadvertently left out when selecting rows.


Unary-valued columns also arise when the data mining effort is focused on
a subset of customers, and the field used to filter the records is retained in the
resulting table. The fields that define this subset may all contain the same
value. If we are building a model to predict the loss-ratio (an insurance mea­
sure) for automobile customers in New Jersey, then the state field will always
have “NJ” filled in. This field has no information content for the sample being
used, so it should be ignored for modeling purposes.

Columns with Almost Only One Value
In “almost-unary” columns, almost all the records have the same value for that
column. There may be a few outliers, but there are very few. For example, retail
Preparing Data for Mining 545


data may summarize all the purchases made by each customer in each depart­
ment. Very few customers may make a purchase from the automotive depart­
ment of a grocery store or the tobacco department of a department store. So,
almost all customers will have a $0 for total purchases from these departments.
Purchased data often comes in an “almost-unary” format, as well. Fields
such as “people who collect porcelain dolls” or “amount spent on greens fees”
will have a null or $0 value for all but very few people. Or, some data, such as
survey data, is only available for a very small subset of the customers. These
are all extreme examples of data skew, shown in Figure 17.3.
The big question with “almost-unary” columns is, “When can they be
ignored?” To justify ignoring them, the values must have two characteristics.
First, almost all the records must have the same value. Second, there must be
so few records with a different value, that they constitute a negligible portion
of the data.
What is a negligible portion of the data? It is a group so small that even if the
data mining algorithms identified it perfectly, the group would be too small to
be significant.


10,000 9988
This chart shows an almost-unary column. The
9,000
column was created by binning telephone call
durations into 10 equal-width bins.
8,000

Almost all values, 9988 out of 9995, are inin the
9,988 out of 9,995, are the
7,000
first bin.
6,000
Count




If variable width bins had been chosen, then the
5,000
resulting column would have been more useful.
4,000

3,000

2,000

1,000
0 1 0 0 0 0 1 4 1
0
[0,639.6]



[639.6,1279.2]



[1279.2,1918.8]



[1918.8,2558.4]



[2558.4,3198]



[3198,3837.6]



[3837.6,4477.2]



[4477.2,5116.8]



[5116.8,5756.4]



[5756.4,6396]




Binned Duration
Figure 17.3 An almost-unary field, such as the bins produced by equal-width bins in this
case, is useless for data mining purposes
546 Chapter 17


Before ignoring a column, though, it is important to understand why the val­
ues are so heavily skewed. What does this column tell us about the business?
Perhaps few people ever buy automotive products because only a handful of
the stores in question even sell them. Identifying customers as “automotive-
product-buyers,” in this case, may not be useful.
In other cases, an event might be rare for other reasons. The number of peo­
ple who cancel their telephone service on any given day is negligible, but over
time the numbers accumulate. So the cancellations need to be accumulated
over a longer time period, such as a month, quarter, or year. Or, the number of
people who collect porcelain dolls may be very rare in itself, but when com­
bined with other fields, this might suggest an important segment of collectors.
The rule of thumb is that, even if a column proves to be very informative, it
is unlikely to be useful for data mining if it is almost-unary. That is, fully
understanding the rows with different values does not yield actionable results.
As a general rule of thumb, if 95 to 99 percent of the values in the column are
identical, the column”in isolation”is likely to be useless without some work.
For instance, if the column in question represents the target variable for a
model, then stratified sampling can create a sample where the rare values are
more highly populated. Another approach is to combine several such columns
for creating derived variables that might prove to be valuable. As an example,
some census fields are sparsely populated, such as those for particular occu­
pations. However, combining some of these fields into a single field”such as
“high status occupation””can prove useful for modeling purposes.

Columns with Unique Values
At the other extreme are categorical columns that take on a different value for
every single row”or almost every row. These columns identify each customer
uniquely (or close enough), for example:
Customer name
––

Address
––

Telephone number
––

Customer ID
––

Vehicle identification number
––


These columns are also not very helpful. Why? They do not have predictive
value, because they uniquely identify each row. Such variables cause overfitting.
One caveat”which will be investigated later in this chapter. Sometimes these
columns contain a wealth of information. Lurking inside telephone numbers
and addresses is important geographical information. Customers™ first names
give an indication of gender. Customer numbers may be sequentially assigned,
telling us which customers are more recent”and hence show up as important
Preparing Data for Mining 547


variables in decision trees. These are cases where the important features (such as
geography and customer recency) should be extracted from the fields as derived
variables. However, data mining algorithms are not yet powerful enough to
extract such information from values; data miners need to do the extraction.

Columns Correlated with Target
When a column is too highly correlated with the target column, it can mean
that the column is just a synonym. Here are two examples:
“Account number is NULL” may be synonymous with failure to
––

respond to a marketing campaign. Only responders opened accounts
and were assigned account numbers.
“Date of churn is not NULL” is synonymous with having churned.
––


Another danger is that the column reflects previous business practices. For
instance, the data may show that all customers with call forwarding also have
call waiting. This is a result of product bundling; call forwarding is sold in a
product bundle that always includes call waiting. Or the data may show that
almost all customers reside in the wealthiest areas, because this where cus­
tomer acquisition campaigns in the past were targeted. This illustrates that
data miners need to know historical business practices. Columns synonymous
with the targets should be ignored.

<<

. 108
( 137 .)



>>