. 39
( 137 .)


number of records in the node reaches some preset lower bound, or when
the depth of the tree reaches some preset limit, the split search for that branch
is abandoned and the node is labeled as a leaf node.
Eventually, it is not possible to find any more splits anywhere in the tree and
the full decision tree has been grown. As we will see, this full tree is generally
not the tree that does the best job of classifying a new set of records.
Decision-tree-building algorithms begin by trying to find the input variable
that does the best job of splitting the data among the desired categories. At each
succeeding level of the tree, the subsets created by the preceding split are them­
selves split according to whatever rule works best for them. The tree continues
to grow until it is no longer possible to find better ways to split up incoming
records. If there were a completely deterministic relationship between the input
variables and the target, this recursive splitting would eventually yield a tree
with completely pure leaves. It is easy to manufacture examples of this sort, but
they do not occur very often in marketing or CRM applications.
Customer behavior data almost never contains such clear, deterministic
relationships between inputs and outputs. The fact that two customers have
the exact same description in terms of the available input variables does not
ensure that they will exhibit the same behavior. A decision tree for a catalog
response model might include a leaf representing females with age greater
than 50, three or more purchases within the last year, and total lifetime spend­
ing of over $145. The customers reaching this leaf will typically be a mix of
responders and nonresponders. If the leaf in question is labeled “responder,”
than the proportion of nonresponders is the error rate for this leaf. The ratio of
the proportion of responders in this leaf to the proportion of responders in the
population is the lift at this leaf.
One circumstance where deterministic rules are likely to be discovered is
when patterns in data reflect business rules. The authors had this fact driven
home to them by an experience at Caterpillar, a manufacturer of diesel
engines. We built a decision tree model to predict which warranty claims
would be approved. At the time, the company had a policy by which certain
176 Chapter 6

claims were paid automatically. The results were startling: The model was 100
percent accurate on unseen test data. In other words, it had discovered the
exact rules used by Caterpillar to classify the claims. On this problem, a neural
network tool was less successful. Of course, discovering known business rules
may not be particularly useful; it does, however, underline the effectiveness of
decision trees on rule-oriented problems.
Many domains, ranging from genetics to industrial processes really do have
underlying rules, though these may be quite complex and obscured by noisy
data. Decision trees are a natural choice when you suspect the existence of
underlying rules.

Measuring the Effectiveness Decision Tree
The effectiveness of a decision tree, taken as a whole, is determined by apply­
ing it to the test set”a collection of records not used to build the tree”and
observing the percentage classified correctly. This provides the classification
error rate for the tree as a whole, but it is also important to pay attention to the
quality of the individual branches of the tree. Each path through the tree rep­
resents a rule, and some rules are better than others.
At each node, whether a leaf node or a branching node, we can measure:
The number of records entering the node

The proportion of records in each class

How those records would be classified if this were a leaf node

The percentage of records classified correctly at this node

The variance in distribution between the training set and the test set

Of particular interest is the percentage of records classified correctly at this
node. Surprisingly, sometimes a node higher up in the tree does a better job of
classifying the test set than nodes lower down.

Tests for Choosing the Best Split
A number of different measures are available to evaluate potential splits. Algo­
rithms developed in the machine learning community focus on the increase in
purity resulting from a split, while those developed in the statistics commu­
nity focus on the statistical significance of the difference between the distribu­
tions of the child nodes. Alternate splitting criteria often lead to trees that look
quite different from one another, but have similar performance. That is
because there are usually many candidate splits with very similar perfor­
mance. Different purity measures lead to different candidates being selected,
but since all of the measures are trying to capture the same idea, the resulting
models tend to behave similarly.
Decision Trees 177

Purity and Diversity
The first edition of this book described splitting criteria in terms of the decrease
in diversity resulting from the split. In this edition, we refer instead to the
increase in purity, which seems slightly more intuitive. The two phrases refer to
the same idea. A purity measure that ranges from 0 (when no two items in the
sample are in the same class) to 1 (when all items in the sample are in the same
class) can be turned into a diversity measure by subtracting it from 1. Some of
the measures used to evaluate decision tree splits assign the lowest score to a
pure node; others assign the highest score to a pure node. This discussion
refers to all of them as purity measures, and the goal is to optimize purity by
minimizing or maximizing the chosen measure.
Figure 6.5 shows a good split. The parent node contains equal numbers of
light and dark dots. The left child contains nine light dots and one dark dot.
The right child contains nine dark dots and one light dot. Clearly, the purity
has increased, but how can the increase be quantified? And how can this split
be compared to others? That requires a formal definition of purity, several of
which are listed below.

Figure 6.5 A good split on a binary categorical variable increases purity.
178 Chapter 6

Purity measures for evaluating splits for categorical target variables include:
Gini (also called population diversity)

Entropy (also called information gain)

Information gain ratio

Chi-square test

When the target variable is numeric, one approach is to bin the value and use
one of the above measures. There are, however, two measures in common
use for numeric targets:
Reduction in variance

F test

Note that the choice of an appropriate purity measure depends on whether
the target variable is categorical or numeric. The type of the input variable does
not matter, so an entire tree is built with the same purity measure. The split
illustrated in 6.5 might be provided by a numeric input variable (AGE > 46) or
by a categorical variable (STATE is a member of CT, MA, ME, NH, RI, VT). The
purity of the children is the same regardless of the type of split.

Gini or Population Diversity
One popular splitting criterion is named Gini, after Italian statistician and
economist, Corrado Gini. This measure, which is also used by biologists and
ecologists studying population diversity, gives the probability that two items
chosen at random from the same population are in the same class. For a pure
population, this probability is 1.
The Gini measure of a node is simply the sum of the squares of the propor­
tions of the classes. For the split shown in Figure 6.5, the parent population has
an equal number of light and dark dots. A node with equal numbers of each of
2 classes has a score of 0.52 + 0.52 = 0.5, which is expected because the chance of
picking the same class twice by random selection with replacement is one out
of two. The Gini score for either of the resulting nodes is 0.12 + 0.92 = 0.82. A
perfectly pure node would have a Gini score of 1. A node that is evenly bal­
anced would have a Gini score of 0.5. Sometimes the scores is doubled and
then 1 subtracted, so it is between 0 and 1. However, such a manipulation
makes no difference when comparing different scores to optimize purity.
To calculate the impact of a split, take the Gini score of each child node and
multiply it by the proportion of records that reach that node and then sum the
resulting numbers. In this case, since the records are split evenly between
the two nodes resulting from the split and each node has the same Gini score,
the score for the split is the same as for either of the two nodes.
Decision Trees 179

Entropy Reduction or Information Gain
Information gain uses a clever idea for defining purity. If a leaf is entirely pure,
then the classes in the leaf can be easily described”they all fall in the same
class. On the other hand, if a leaf is highly impure, then describing it is much
more complicated. Information theory, a part of computer science, has devised
a measure for this situation called entropy. In information theory, entropy is a
measure of how disorganized a system is. A comprehensive introduction to
information theory is far beyond the scope of this book. For our purposes, the
intuitive notion is that the number of bits required to describe a particular sit­
uation or outcome depends on the size of the set of possible outcomes. Entropy
can be thought of as a measure of the number of yes/no questions it would
take to determine the state of the system. If there are 16 possible states, it takes
log2(16), or four bits, to enumerate them or identify a particular one. Addi­
tional information reduces the number of questions needed to determine the
state of the system, so information gain means the same thing as entropy
reduction. Both terms are used to describe decision tree algorithms.
The entropy of a particular decision tree node is the sum, over all the classes
represented in the node, of the proportion of records belonging to a particular
class multiplied by the base two logarithm of that proportion. (Actually, this
sum is usually multiplied by “1 in order to obtain a positive number.) The
entropy of a split is simply the sum of the entropies of all the nodes resulting
from the split weighted by each node™s proportion of the records. When
entropy reduction is chosen as a splitting criterion, the algorithm searches for
the split that reduces entropy (or, equivalently, increases information) by the
greatest amount.
For a binary target variable such as the one shown in Figure 6.5, the formula
for the entropy of a single node is

-1 * ( P(dark)log2P(dark) + P(light)log2P(light) )

In this example, P(dark) and P(light) are both one half. Plugging 0.5 into the
entropy formula gives:

-1 * (0.5 log2(0.5) + 0.5 log2(0.5))

The first term is for the light dots and the second term is for the dark dots,
but since there are equal numbers of light and dark dots, the expression sim­
plifies to “1 * log2(0.5) which is +1. What is the entropy of the nodes resulting
from the split? One of them has one dark dot and nine light dots, while the
other has nine dark dots and one light dots. Clearly, they each have the same
level of entropy. Namely,

-1 * (0.1 log2(0.1) + 0.9 log2(0.9)) = 0.33 + 0.14 = 0.47
180 Chapter 6

To calculate the total entropy of the system after the split, multiply the
entropy of each node by the proportion of records that reach that node and
add them up to get an average. In this example, each of the new nodes receives
half the records, so the total entropy is the same as the entropy of each of the
nodes, 0.47. The total entropy reduction or information gain due to the split is
therefore 0.53. This is the figure that would be used to compare this split with
other candidates.

Information Gain Ratio
The entropy split measure can run into trouble when combined with a splitting
methodology that handles categorical input variables by creating a separate
branch for each value. This was the case for ID3, a decision tree tool developed
by Australian researcher J. Ross Quinlan in the nineteen-eighties, that became
part of several commercial data mining software packages. The problem is that
just by breaking the larger data set into many small subsets , the number of
classes represented in each node tends to go down, and with it, the entropy. The
decrease in entropy due solely to the number of branches is called the intrinsic
information of a split. (Recall that entropy is defined as the sum over all the
branches of the probability of each branch times the log base 2 of that probabil­
ity. For a random n-way split, the probability of each branch is 1/n. Therefore,
the entropy due solely to splitting from an n-way split is simply n * 1/n log
(1/n) or log(1/n). Because of the intrinsic information of many-way splits,
decision trees built using the entropy reduction splitting criterion without any
correction for the intrinsic information due to the split tend to be quite bushy.
Bushy trees with many multi-way splits are undesirable as these splits lead to
small numbers of records in each node, a recipe for unstable models.
In reaction to this problem, C5 and other descendents of ID3 that once used
information gain now use the ratio of the total information gain due to a pro­
posed split to the intrinsic information attributable solely to the number of
branches created as the criterion for evaluating proposed splits. This test
reduces the tendency towards very bushy trees that was a problem in earlier
decision tree software packages.

Chi-Square Test
As described in Chapter 5, the chi-square (X2) test is a test of statistical signifi­
cance developed by the English statistician Karl Pearson in 1900. Chi-square is
defined as the sum of the squares of the standardized differences between the
expected and observed frequencies of some occurrence between multiple disjoint
samples. In other words, the test is a measure of the probability that an
observed difference between samples is due only to chance. When used to
measure the purity of decision tree splits, higher values of chi-square mean
that the variation is more significant, and not due merely to chance.
Decision Trees 181


Consider the following two splits, illustrated in the figure below. In both cases,
the population starts out perfectly balanced between dark and light dots with
ten of each type. One proposed split is the same as in Figure 6.5 yielding two


. 39
( 137 .)