. 37
( 137 .)


censored data.
162 Chapter 5

Figure 5.11 shows another situation with the same result. This curve shows
sales and inventory for a retailer for one product. Sales are always less than or
equal to the inventory. On the days with the Xs, though, the inventory sold
out. What were the potential sales on these days? The potential sales are
greater than or equal to the observed sales”another example of censored data.
Truncated data poses another problem in terms of biasing samples. Trun­
cated data is not included in databases, often because it is too old. For instance,
when Company A purchases Company B, their systems are merged. Often, the
active customers from Company B are moved into the data warehouse for
Company A. That is, all customers active on a given date are moved over.
Customers who had stopped the day before are not moved over. This is an
example of left truncation, and it pops up throughout corporate databases,
usually with no warning (unless the documentation is very good about saying

what is not in the warehouse as well as what is). This can cause confusion

when looking at when customers started”and discovering that all customers
who started 5 years before the merger were mysteriously active for at least 5
years. This is not due to a miraculous acquisition program. This is because all
the ones who stopped earlier were excluded.

Lessons Learned
This chapter talks about some basic statistical methods that are useful for ana­
lyzing data. When looking at data, it is useful to look at histograms and cumu­
lative histograms to see what values are most common. More important,
though, is looking at values over time.
One of the big questions addressed by statistics is whether observed values
are expected or not. For this, the number of standard deviations from the mean
(z-score) can be used to calculate the probability of the value being due to
chance (the p-value). High p-values mean that the null hypothesis is true; that
is, nothing interesting is happening. Low p-values are suggestive that other
factors may be influencing the results. Converting z-scores to p-values
depends on the normal distribution.
Business problems often require analyzing data expressed as proportions.
Fortunately, these behave similarly to normal distributions. The formula for the
standard error for proportions (SEP) makes it possible to define a confidence
interval on a proportion such as a response rate. The standard error for the dif­
ference of proportions (SEDP) makes it possible to determine whether two val­
ues are similar. This works by defining a confidence interval for the difference
between two values.
When designing marketing tests, the SEP and SEDP can be used for sizing
test and control groups. In particular, these groups should be large enough to

The Lure of Statistics: Data Mining Using Familiar Tools 163

measure differences in response with a high enough confidence. Tests that
have more than two groups need to take into account an adjustment, called
Bonferroni™s correction, when setting the group sizes.
The chi-square test is another statistical method that is often useful. This
method directly calculates the estimated values for data laid out in rows and
columns. Based on these estimates, the chi-square test can determine whether
the results are likely or unlikely. As shown in an example, the chi-square test
and SEDP methods produce similar results.
Statisticians and data miners solve similar problems. However, because of
historical differences and differences in the nature of the problems, there are
some differences in approaches. Data miners generally have lots and lots of
data with few measurement errors. This data changes over time, and values
are sometimes incomplete. The data miner has to be particularly suspicious
about bias introduced into the data by business processes.
The next eight chapters dive into more detail into more modern techniques
for building models and understanding data. Many of these techniques have
been adopted by statisticians and build on over a century of work in this area.


Decision Trees

Decision trees are powerful and popular for both classification and prediction.
The attractiveness of tree-based methods is due largely to the fact that decision
trees represent rules. Rules can readily be expressed in English so that we
humans can understand them; they can also be expressed in a database access
language such as SQL to retrieve records in a particular category. Decision
trees are also useful for exploring data to gain insight into the relationships of
a large number of candidate input variables to a target variable. Because deci­
sion trees combine both data exploration and modeling, they are a powerful
first step in the modeling process even when building the final model using
some other technique.
There is often a trade-off between model accuracy and model transparency.
In some applications, the accuracy of a classification or prediction is the only
thing that matters; if a direct mail firm obtains a model that can accurately pre­
dict which members of a prospect pool are most likely to respond to a certain
solicitation, the firm may not care how or why the model works. In other situ­
ations, the ability to explain the reason for a decision is crucial. In insurance
underwriting, for example, there are legal prohibitions against discrimination
based on certain variables. An insurance company could find itself in the posi­
tion of having to demonstrate to a court of law that it has not used illegal dis­
criminatory practices in granting or denying coverage. Similarly, it is more
acceptable to both the loan officer and the credit applicant to hear that an
application for credit has been denied on the basis of a computer-generated

166 Chapter 6

rule (such as income below some threshold and number of existing revolving
accounts greater than some other threshold) than to hear that the decision has
been made by a neural network that provides no explanation for its action.
This chapter begins with an examination of what decision trees are, how
they work, and how they can be applied to classification and prediction prob­
lems. It then describes the core algorithm used to build decision trees and dis­
cusses some of the most popular variants of that core algorithm. Practical
examples drawn from the authors™ experience are used to demonstrate the
utility and general applicability of decision tree models and to illustrate prac­
tical considerations that must be taken into account.

What Is a Decision Tree?
A decision tree is a structure that can be used to divide up a large collection of
records into successively smaller sets of records by applying a sequence of
simple decision rules. With each successive division, the members of the
resulting sets become more and more similar to one another. The familiar divi­
sion of living things into kingdoms, phyla, classes, orders, families, genera,
and species, invented by the Swedish botanist Carl Linnaeus in the 1730s, pro­
vides a good example. Within the animal kingdom, a particular animal is
assigned to the phylum chordata if it has a spinal cord. Additional characteris­
tics are used to further subdivide the chordates into the birds, mammals, rep­
tiles, and so on. These classes are further subdivided until, at the lowest level
in the taxonomy, members of the same species are not only morphologically
similar, they are capable of breeding and producing fertile offspring.
A decision tree model consists of a set of rules for dividing a large heteroge­
neous population into smaller, more homogeneous groups with respect to a
particular target variable. A decision tree may be painstakingly constructed by
hand in the manner of Linnaeus and the generations of taxonomists that fol­
lowed him, or it may be grown automatically by applying any one of several
decision tree algorithms to a model set comprised of preclassified data. This
chapter is mostly concerned with the algorithms for automatically generating
decision trees. The target variable is usually categorical and the decision tree
model is used either to calculate the probability that a given record belongs to
each of the categories, or to classify the record by assigning it to the most likely
class. Decision trees can also be used to estimate the value of a continuous
variable, although there are other techniques more suitable to that task.

Anyone familiar with the game of Twenty Questions will have no difficulty
understanding how a decision tree classifies records. In the game, one player
Decision Trees 167

thinks of a particular place, person, or thing that would be known or recognized
by all the participants, but the player gives no clue to its identity. The other play­
ers try to discover what it is by asking a series of yes-or-no questions. A good
player rarely needs the full allotment of 20 questions to move all the way from
“Is it bigger than a bread box?” to “the Golden Gate Bridge.”
A decision tree represents such a series of questions. As in the game, the
answer to the first question determines the follow-up question. The initial
questions create broad categories with many members; follow-on questions
divide the broad categories into smaller and smaller sets. If the questions are
well chosen, a surprisingly short series is enough to accurately classify an
incoming record.
The game of Twenty Questions illustrates the process of using a tree for
appending a score or class to a record. A record enters the tree at the root node.
The root node applies a test to determine which child node the record will
encounter next. There are different algorithms for choosing the initial test, but
the goal is always the same: To choose the test that best discriminates among
the target classes. This process is repeated until the record arrives at a leaf node.
All the records that end up at a given leaf of the tree are classified the same
way. There is a unique path from the root to each leaf. That path is an expres­
sion of the rule used to classify the records.
Different leaves may make the same classification, although each leaf makes
that classification for a different reason. For example, in a tree that classifies
fruits and vegetables by color, the leaves for apple, tomato, and cherry might all
predict “red,” albeit with varying degrees of confidence since there are likely to
be examples of green apples, yellow tomatoes, and black cherries as well.
The decision tree in Figure 6.1 classifies potential catalog recipients as likely
(1) or unlikely (0) to place an order if sent a new catalog.
The tree in Figure 6.1 was created using the SAS Enterprise Miner Tree
Viewer tool. The chart is drawn according to the usual convention in data
mining circles”with the root at the top and the leaves at the bottom, perhaps
indicating that data miners ought to get out more to see how real trees grow.
Each node is labeled with a node number in the upper-right corner and the
predicted class in the center. The decision rules to split each node are printed
on the lines connecting each node to its children. The split at the root node on
“lifetime orders”; the left branch is for customers who had six or fewer orders
and the right branch is for customers who had seven or more.
Any record that reaches leaf nodes 19, 14, 16, 17, or 18 is classified as likely
to respond, because the predicted class in this case is 1. The paths to these leaf
nodes describe the rules in the tree. For example, the rule for leaf 19 is If the cus­
tomer has made more than 6.5 orders and it has been fewer than 765 days since the last
order, the customer is likely to respond.
168 Chapter 6

lifetime orders

< ≥
6.5 6.5

2 3
0 1
$ last 24 months days since last

< ≥ < 756 ≥ 756
19.475 19.475

7 17 8 19 20
0 1 1 0
kitchen tot $980.3

< ≥
1 0 19.325 19.325
9 10 13 14
0 0 1 1
women™s underwear lifetime orders

< ≥
0 1 1.5 1.5

11 12 15 16
0 0 0 1

< ≥
1.5 1.5
17 17 18
1 1
Figure 6.1 A binary decision tree classifies catalog recipients as likely or unlikely to place
an order.

Alert readers may notice that some of the splits in the decision tree appear
to make no difference. For example, nodes 17 and 18 are differentiated by the
number of orders they have made that included items in the food category, but
both nodes are labeled as responders. That is because although the probability
of response is higher in node 18 than in node 17, in both cases it is above the
threshold that has been set for classifying a record as a responder. As a classi­
fier, the model has only two outputs, one and zero. This binary classification
throws away useful information, which brings us to the next topic, using
decision trees to produce scores and probabilities.
Decision Trees 169

Figure 6.2 is a picture of the same tree as in Figure 6.1, but using a different tree
viewer and with settings modified so that the tree is now annotated with addi­
tional information”namely, the percentage of records in class 1 at each node.
It is now clear that the tree describes a dataset containing half responders
and half nonresponders, because the root node has a proportion of 50 percent.
As described in Chapter 3, this is typical of a training set for a response model
with a binary target variable. Any node with more than 50 percent responders
is labeled with “1” in Figure 6.1, including nodes 17 and 18. Figure 6.2 clarifies
the difference between these nodes. In Node 17, 52.8 percent of records repre­
sent responders, while in Node 18, 66.9 percent do. Clearly, a record in Node
18 is more likely to represent a responder than a record in Node 17. The pro­
portion of records in the desired class can be used as a score, which is often


. 37
( 137 .)