. 38
( 137 .)


more useful than just the classification. For a binary outcome, a classification
merely splits records into two groups. A score allows the records to be sorted
from most likely to least likely to be members of the desired class.

Figure 6.2 A decision tree annotated with the proportion of records in class 1 at each
node shows the probability of the classification.
170 Chapter 6

For many applications, a score capable of rank-ordering a list is all that is
required. This is sufficient to choose the top N percent for a mailing and to cal­
culate lift at various depths in the list. For some applications, however, it is not
sufficient to know that A is more likely to respond than B; we want to know
that actual likelihood of a response from A. Assuming that the prior probabil­
ity of a response is known, it can be used to calculate the probability of
response from the score generated on the oversampled data used to build the
tree. Alternatively, the model can be applied to preclassified data that has a
distribution of responses that reflects the true population. This method, called
backfitting, creates scores using the class proportions at the tree™s leaves to rep­
resent the probability that a record drawn from a similar population is a mem­
ber of the class. These, and related issues, are discussed in detail in Chapter 3.

Suppose the important business question is not who will respond but what will
be the size of the customer™s next order? The decision tree can be used to answer
that question too. Assuming that order amount is one of the variables available
in the preclassified model set, the average order size in each leaf can be used as
the estimated order size for any unclassified record that meets the criteria for
that leaf. It is even possible to use a numeric target variable to build the tree;
such a tree is called a regression tree. Instead of increasing the purity of a cate­
gorical variable, each split in the tree is chosen to decrease the variance in the
values of the target variable within each child node.
The fact that trees can be (and sometimes are) used to estimate continuous
values does not make it a good idea. A decision tree estimator can only gener­
ate as many discrete values as there are leaves in the tree. To estimate a contin­
uous variable, it is preferable to use a continuous function. Regression models
and neural network models are generally more appropriate for estimation.

Trees Grow in Many Forms
The tree in Figure 6.1 is a binary tree of nonuniform depth; that is, each nonleaf
node has two children and leaves are not all at the same distance from the root.
In this case, each node represents a yes-or-no question, whose answer deter­
mines by which of two paths a record proceeds to the next level of the tree. Since
any multiway split can be expressed as a series of binary splits, there is no real
need for trees with higher branching factors. Nevertheless, many data mining
tools are capable of producing trees with more than two branches. For example,
some decision tree algorithms split on categorical variables by creating a branch
for each class, leading to trees with differing numbers of branches at different
nodes. Figure 6.3 illustrates a tree that uses both three-way and two-way splits
for the same classification problem as the tree in Figures 6.1 and 6.2.
Decision Trees 171


tot units demand

< 4.5 ≥ 15.5
[4.5, 15.5]

40% 47% 66%

num VOM orders days since last days since last

<1 ≥1 ≥ 725 < 756 ≥ 756
1 221.5

39% 42% 58% 52% 37% 72% 40%

average $ demand GMM buyer flag

< 47.6 ≥ 47.6 0 1

40% 33% 52% 54%

avg $/month tot $ 9604

<2 2.10132,4. ≥ 4.116 9.15, 4 ≥ 41.255

37% 48% 85% 49% 69% 50%
Figure 6.3 This ternary decision tree is applied to the same the same classification
problem as in Figure 6.1.

T I P There is no relationship between the number of branches allowed at a
node and the number of classes in the target variable. A binary tree (that is,
one with two-way splits) can be used to classify records into any number of
categories, and a tree with multiway splits can be used to classify a binary
target variable.

How a Decision Tree Is Grown
Although there are many variations on the core decision tree algorithm, all of
them share the same basic procedure: Repeatedly split the data into smaller
and smaller groups in such a way that each new generation of nodes has
greater purity than its ancestors with respect to the target variable. For most of
this discussion, we assume a binary, categorical target variable, such as
responder/nonresponder. This simplifies the explanations without much loss
of generality.
172 Chapter 6

Finding the Splits
At the start of the process, there is a training set consisting of preclassified
records”that is, the value of the target variable is known for all cases. The goal
is to build a tree that assigns a class (or a likelihood of membership in each class)
to the target field of a new record based on the values of the input variables.
The tree is built by splitting the records at each node according to a function
of a single input field. The first task, therefore, is to decide which of the input
fields makes the best split. The best split is defined as one that does the best job
of separating the records into groups where a single class predominates in
each group.
The measure used to evaluate a potential split is purity. The next section
talks about specific methods for calculating purity in more detail. However,

they are all trying to achieve the same effect. With all of them, low purity

means that the set contains a representative distribution of classes (relative to
the parent node), while high purity means that members of a single class pre­
dominate. The best split is the one that increases the purity of the record sets
by the greatest amount. A good split also creates nodes of similar size, or at
least does not create nodes containing very few records.
These ideas are easy to see visually. Figure 6.4 illustrates some good and bad


Original Data

Poor Split Poor Split

Good Split
Figure 6.4 A good split increases purity for all the children.

Decision Trees 173

The first split is a poor one because there is no increase in purity. The initial
population contains equal numbers of the two sorts of dot; after the split, so
does each child. The second split is also poor, because all though purity is
increased slightly, the pure node has few members and the purity of the larger
child is only marginally better than that of the parent. The final split is a good
one because it leads to children of roughly same size and with much higher
purity than the parent.
Tree-building algorithms are exhaustive. They proceed by taking each input
variable in turn and measuring the increase in purity that results from every
split suggested by that variable. After trying all the input variables, the one
that yields the best split is used for the initial split, creating two or more chil­
dren. If no split is possible (because there are too few records) or if no split
makes an improvement, then the algorithm is finished with that node and the
node become a leaf node. Otherwise, the algorithm performs the split and
repeats itself on each of the children. An algorithm that repeats itself in this
way is called a recursive algorithm.
Splits are evaluated based on their effect on node purity in terms of the tar­
get variable. This means that the choice of an appropriate splitting criterion
depends on the type of the target variable, not on the type of the input variable.
With a categorical target variable, a test such as Gini, information gain, or chi-
square is appropriate whether the input variable providing the split is numeric
or categorical. Similarly, with a continuous, numeric variable, a test such as
variance reduction or the F-test is appropriate for evaluating the split regard­
less of whether the input variable providing the split is categorical or numeric.

Splitting on a Numeric Input Variable
When searching for a binary split on a numeric input variable, each value that
the variable takes on in the training set is treated as a candidate value for
the split. Splits on a numeric variable take the form X<N. All records where the
value of X (the splitting variable) is less than some constant N are sent to one
child and all records where the value of X is greater than or equal to N are sent
to the other. After each trial split, the increase in purity, if any, due to the
split is measured. In the interests of efficiency, some implementations of
the splitting algorithm do not actually evaluate every value; they evaluate a
sample of the values instead.
When the decision tree is scored, the only use that it makes of numeric
inputs is to compare their values with the split points. They are never multi­
plied by weights or added together as they are in many other types of models.
This has the important consequence that decision trees are not sensitive to out­
liers or skewed distributions of numeric variables, because the tree only uses
the rank of numeric variables and not their absolute values.
174 Chapter 6

Splitting on a Categorical Input Variable
The simplest algorithm for splitting on a categorical input variable is simply to
create a new branch for each class that the categorical variable can take on. So,
if color is chosen as the best field on which to split the root node, and the train­
ing set includes records that take on the values red, orange, yellow, green, blue,
indigo, and violet, then there will be seven nodes in the next level of the tree.
This approach is actually used by some software packages, but it often yields
poor results. High branching factors quickly reduce the population of training
records available at each node in lower levels of the tree, making further splits
less reliable.
A more common approach is to group together the classes that, taken indi­
vidually, predict similar outcomes. More precisely, if two classes of the input
variable yield distributions of the classes of the output variable that do not
differ significantly from one another, the two classes can be merged. The usual
test for whether the distributions differ significantly is the chi-square test.

Splitting in the Presence of Missing Values
One of the nicest things about decision trees is their ability to handle missing
values in either numeric or categorical input fields by simply considering null
to be a possible value with its own branch. This approach is preferable to
throwing out records with missing values or trying to impute missing values.
Throwing out records due to missing values is likely to create a biased training
set because the records with missing values are not likely to be a random sam­
ple of the population. Replacing missing values with imputed values has the
risk that important information provided by the fact that a value is missing will
be ignored in the model. We have seen many cases where the fact that a partic­
ular value is null has predictive value. In one such case, the count of non-null
values in appended household-level demographic data was positively corre­
lated with response to an offer of term life insurance. Apparently, people who
leave many traces in Acxiom™s household database (by buying houses, getting
married, registering products, and subscribing to magazines) are more likely to
be interested in life insurance than those whose lifestyles leave more fields null.

T I P Decision trees can produce splits based on missing values of an input
variable. The fact that a value is null can often have predictive value so do not
be hasty to filter out records with missing values or to try to replace them with
imputed values.

Although splitting on null as a separate class is often quite valuable, at least
one data mining product offers an alternative approach as well. In Enterprise
Miner, each node stores several possible splitting rules, each one based on a
different input field. When a null value is encountered in the field that yields
Decision Trees 175

the best splits, the software uses the surrogate split based on the next best avail­
able input variable.

Growing the Full Tree
The initial split produces two or more child nodes, each of which is then split in
the same manner as the root node. Once again, all input fields are considered as
candidate splitters, even fields already used for splits. However, fields that take
on only one value are eliminated from consideration since there is no way that
they can be used to create a split. A categorical field that has been used as a
splitter higher up in the tree is likely to become single-valued fairly quickly. The
best split for each of the remaining fields is determined. When no split can be
found that significantly increases the purity of a given node, or when the


. 38
( 137 .)