. 40
( 137 .)


equal-sized nodes, one 90 percent dark and the other 90 percent light. The
second split yields one node that is 100 percent pure dark, but only has 6 dots
and another that that has 14 dots and is 71.4 percent light.

Which of these two proposed splits increases purity the most?

As explained in the main text, the Gini score for each of the two children in the
first proposed split is 0.12 + 0.92 = 0.820. Since the children are the same size,
this is also the score for the split.
What about the second proposed split? The Gini score of the left child is 1
since only one class is represented. The Gini score of the right child is
Giniright = (4/14)2 + (10/14)2 = 0.082 + 0.510 = 0.592
and the Gini score for the split is:
(6/20)Ginileft + (14/20)Giniright = 0.3*1 + 0.7*0.592 = 0.714

Since the Gini score for the first proposed split (0.820) is greater than for the
second proposed split (0.714), a tree built using the Gini criterion will prefer the
split that yields two nearly pure children over the split that yields one
completely pure child along with a larger, less pure one.
182 Chapter 6


As calculated in the main text, the entropy of the parent node is 1. The entropy
of the first proposed split is also calculated in the main text and found to be
0.47 so the information gain for the first proposed split is 0.53.
How much information is gained by the second proposed split? The left child
is pure and so has entropy of 0. As for the right child, the formula for entropy is
-(P(dark)log2P(dark) + P(light)log2P(light))

so the entropy of the right child is:
Entropyright = -((4/14)log2(4/14) + (10/14)log2(10/14)) = 0.516 +
0.347 = 0.863

The entropy of the split is the weighted average of the entropies of the

resulting nodes. In this case,
0.3*Entropyleft + 0.7*Entropyright = 0.3*0 + 0.7*0.863 = 0.604
Subtracting 0.604 from the entropy of the parent (which is 1) yields an
information gain of 0.396. This is less than 0.53, the information gain from the
first proposed split, so in this case, entropy splitting criterion also prefers the
first split to the second. Compared to Gini, the entropy criterion does have a

stronger preference for nodes that are purer, even if smaller. This may be
appropriate in domains where there really are clear underlying rules, but it
tends to lead to less stable trees in “noisy” domains such as response to
marketing offers.

For example, suppose the target variable is a binary flag indicating whether
or not customers continued their subscriptions at the end of the introductory
offer period and the proposed split is on acquisition channel, a categorical
variable with three classes: direct mail, outbound call, and email. If the acqui­
sition channel had no effect on renewal rate, we would expect the number of
renewals in each class to be proportional to the number of customers acquired
through that channel. For each channel, the chi-square test subtracts that
expected number of renewals from the actual observed renewals, squares the
difference, and divides the difference by the expected number. The values for
each class are added together to arrive at the score. As described in Chapter 5,
the chi-square distribution provide a way to translate this chi-square score into
a probability. To measure the purity of a split in a decision tree, the score is
sufficient. A high score means that the proposed split successfully splits the
population into subpopulations with significantly different distributions.
The chi-square test gives its name to CHAID, a well-known decision tree
algorithm first published by John A. Hartigan in 1975. The full acronym stands
for Chi-square Automatic Interaction Detector. As the phrase “automatic inter­
action detector” implies, the original motivation for CHAID was for detecting

Decision Trees 183

statistical relationships between variables. It does this by building a decision
tree, so the method has come to be used as a classification tool as well. CHAID
makes use of the Chi-square test in several ways”first to merge classes that do
not have significantly different effects on the target variable; then to choose a
best split; and finally to decide whether it is worth performing any additional
splits on a node. In the research community, the current fashion is away from
methods that continue splitting only as long as it seems likely to be useful and
towards methods that involve pruning. Some researchers, however, still prefer
the original CHAID approach, which does not rely on pruning.
The chi-square test applies to categorical variables so in the classic CHAID
algorithm, input variables must be categorical. Continuous variables must be
binned or replaced with ordinal classes such as high, medium, low. Some cur­
rent decision tree tools such as SAS Enterprise Miner, use the chi-square test
for creating splits using categorical variables, but use another statistical test,
the F test, for creating splits on continuous variables. Also, some implementa­
tions of CHAID continue to build the tree even when the splits are not statisti­
cally significant, and then apply pruning algorithms to prune the tree back.

Reduction in Variance
The four previous measures of purity all apply to categorical targets. When the
target variable is numeric, a good split should reduce the variance of the target
variable. Recall that variance is a measure of the tendency of the values in a
population to stay close to the mean value. In a sample with low variance,
most values are quite close to the mean; in a sample with high variance, many
values are quite far from the mean. The actual formula for the variance is the
mean of the sums of the squared deviations from the mean. Although the
reduction in variance split criterion is meant for numeric targets, the dark and
light dots in Figure 6.5 can still be used to illustrate it by considering the dark
dots to be 1 and the light dots to be 0. The mean value in the parent node is
clearly 0.5. Every one of the 20 observations differs from the mean by 0.5, so
the variance is (20 * 0.52) / 20 = 0.25. After the split, the left child has 9 dark
spots and one light spot, so the node mean is 0.9. Nine of the observations dif­
fer from the mean value by 0.1 and one observation differs from the mean
value by 0.9 so the variance is (0.92 + 9 * 0.12) / 10 = 0.09. Since both nodes
resulting from the split have variance 0.09, the total variance after the split is
also 0.09. The reduction in variance due to the split is 0.25 “ 0.09 = 0.16.

F Test
Another split criterion that can be used for numeric target variables is the F test,
named for another famous Englishman”statistician, astronomer, and geneti­
cist, Ronald. A. Fisher. Fisher and Pearson reportedly did not get along despite,
or perhaps because of, the large overlap in their areas of interest. Fisher™s test
184 Chapter 6

does for continuous variables what Pearson™s chi-square test does for categori­
cal variables. It provides a measure of the probability that samples with differ­
ent means and variances are actually drawn from the same population.
There is a well-understood relationship between the variance of a sample
and the variance of the population from which it was drawn. (In fact, so long
as the samples are of reasonable size and randomly drawn from the popula­
tion, sample variance is a good estimate of population variance; very small
samples”with fewer than 30 or so observations”usually have higher vari­
ance than their corresponding populations.) The F test looks at the relationship
between two estimates of the population variance”one derived by pooling all
the samples and calculating the variance of the combined sample, and one
derived from the between-sample variance calculated as the variance of the
sample means. If the various samples are randomly drawn from the same
population, these two estimates should agree closely.
The F score is the ratio of the two estimates. It is calculated by dividing the
between-sample estimate by the pooled sample estimate. The larger the score,
the less likely it is that the samples are all randomly drawn from the same
population. In the decision tree context, a large F-score indicates that a pro­
posed split has successfully split the population into subpopulations with
significantly different distributions.

As previously described, the decision tree keeps growing as long as new splits
can be found that improve the ability of the tree to separate the records of the
training set into increasingly pure subsets. Such a tree has been optimized for
the training set, so eliminating any leaves would only increase the error rate of
the tree on the training set. Does this imply that the full tree will also do the
best job of classifying new datasets? Certainly not!
A decision tree algorithm makes its best split first, at the root node where
there is a large population of records. As the nodes get smaller, idiosyncrasies
of the particular training records at a node come to dominate the process. One
way to think of this is that the tree finds general patterns at the big nodes and
patterns specific to the training set in the smaller nodes; that is, the tree over-
fits the training set. The result is an unstable tree that will not make good
predictions. The cure is to eliminate the unstable splits by merging smaller
leaves through a process called pruning; three general approaches to pruning
are discussed in detail.
Decision Trees 185

The CART Pruning Algorithm
CART is a popular decision tree algorithm first published by Leo Breiman,
Jerome Friedman, Richard Olshen, and Charles Stone in 1984. The acronym
stands for Classification and Regression Trees. The CART algorithm grows
binary trees and continues splitting as long as new splits can be found that
increase purity. As illustrated in Figure 6.6, inside a complex tree, there are
many simpler subtrees, each of which represents a different trade-off between
model complexity and training set misclassification rate. The CART algorithm
identifies a set of such subtrees as candidate models. These candidate subtrees
are applied to the validation set and the tree with the lowest validation set mis-
classification rate is selected as the final model.

Creating the Candidate Subtrees
The CART algorithm identifies candidate subtrees through a process of
repeated pruning. The goal is to prune first those branches providing the least
additional predictive power per leaf. In order to identify these least useful
branches, CART relies on a concept called the adjusted error rate. This is a mea­
sure that increases each node™s misclassification rate on the training set by
imposing a complexity penalty based on the number of leaves in the tree. The
adjusted error rate is used to identify weak branches (those whose misclassifi­
cation rate is not low enough to overcome the penalty) and mark them for

Figure 6.6 Inside a complex tree, there are simpler, more stable trees.
186 Chapter 6


The error rate on the validation set should be larger than the error rate on the
training set, because the training set was used to build the rules in the model.
A large difference in the misclassification error rate, however, is a symptom of
an unstable model. This difference can show up in several ways as shown by
the following three graphs generated by SAS Enterprise Miner. The graphs
represent the percent of records correctly classified by the candidate models in
a decision tree. Candidate subtrees with fewer nodes are on the left; with more
nodes are on the right. These figures show the percent correctly classified
instead of the error rate, so they are upside down from the way similar charts
are shown elsewhere in this book.
As expected, the first chart shows the candidate trees performing better and
better on the training set as the trees have more and more nodes”the training
process stops when the performance no longer improves. On the validation set,
however, the candidate trees reach a peak and then the performance starts to
decline as the trees get larger. The optimal tree is the one that works on the
validation set, and the choice is easy because the peak is well-defined.

This chart shows a clear inflection point in the graph of the percent correctly classified
in the validation set.
Decision Trees 187


Sometimes, though, there is not clear demarcation point. That is, the
performance of the candidate models on the validation set never quite reaches
a maximum as the trees get larger. In this case, the pruning algorithm chooses
the entire tree (the largest possible subtree), as shown in the following

Proportion Correctly Classified


. 40
( 137 .)