. 41
( 137 .)


0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580
Number of Leaves

In this chart, the percent correctly classified in the validation set levels off early and
remains far below the percent correctly classified in the training set.

The final example is perhaps the most interesting, because the results on the
validation set become unstable as the candidate trees become larger. The cause
of the instability is that the leaves are too small. In this tree, there is an
example of a leaf that has three records from the training set and all three have
a target value of 1 “ a perfect leaf. However, in the validation set, the one
record that falls there has the value 0. The leaf is 100 percent wrong. As the
tree grows more complex, more of these too-small leaves are included,
resulting in the instability seen below:
188 Chapter 6

Proportion of Event in Top Ranks (10%)





0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580
Number of Leaves

In this chart, the percent correctly classified on the validation set decreases with the
complexity of the tree and eventually becomes chaotic.

The last two figures are examples of unstable models. The simplest way to
avoid instability of this sort is to ensure that leaves are not allowed to become
too small.

The formula for the adjusted error rate is:

AE(T) = E(T) + ±leaf_count(T)

Where ± is an adjustment factor that is increased in gradual steps to create
new subtrees. When ± is zero, the adjusted error rate equals the error rate. To
find the first subtree, the adjusted error rates for all possible subtrees contain­
ing the root node are evaluated as ± is gradually increased. When the adjusted
error rate of some subtree becomes less than or equal to the adjusted error rate
for the complete tree, we have found the first candidate subtree, ±1. All
branches that are not part of ±1 are pruned and the process starts again. The ±1
tree is pruned to create an ±2 tree. The process ends when the tree has been
pruned all the way down to the root node. Each of the resulting subtrees (some­
times called the alphas) is a candidate to be the final model. Notice that all the
candidates contain the root node and the largest candidate is the entire tree.
Decision Trees 189

Picking the Best Subtree
The next task is to select, from the pool of candidate subtrees, the one that
works best on new data. That, of course, is the purpose of the validation set.
Each of the candidate subtrees is used to classify the records in the validation
set. The tree that performs this task with the lowest overall error rate is
declared the winner. The winning subtree has been pruned sufficiently to
remove the effects of overtraining, but not so much as to lose valuable infor­
mation. The graph in Figure 6.7 illustrates the effect of pruning on classifica­
tion accuracy. The technical aside goes into this in more detail.
Because this pruning algorithm is based solely on misclassification rate,
without taking the probability of each classification into account, it replaces
any subtree whose leaves all make the same classification with a common par­
ent that also makes that classification. In applications where the goal is to
select a small proportion of the records (the top 1 percent or 10 percent, for
example), this pruning algorithm may hurt the performance of the tree, since
some of the removed leaves contain a very high proportion of the target class.
Some tools, such as SAS Enterprise Miner, allow the user to prune trees
optimally for such situations.

Using the Test Set to Evaluate the Final Tree
The winning subtree was selected on the basis of its overall error rate when
applied to the task of classifying the records in the validation set. But, while we
expect that the selected subtree will continue to be the best performing subtree
when applied to other datasets, the error rate that caused it to be selected may
slightly overstate its effectiveness. There are likely to be a large number of sub-
trees that all perform about as well as the one selected. To a certain extent, the
one of these that delivered the lowest error rate on the validation set may
simply have “gotten lucky” with that particular collection of records. For that
reason, as explained in Chapter 3, the selected subtree is applied to a third
preclassified dataset that is disjoint with both the validation set and the train­
ing set. This third dataset is called the test set. The error rate obtained on the
test set is used to predict expected performance of the classification rules rep­
resented by the selected tree when applied to unclassified data.

WA R N I N G Do not evaluate the performance of a model by its lift or error
rate on the validation set. Like the training set, it has had a hand in creating the
model and so will overstate the model™s accuracy. Always measure the model™s
accuracy on a test set that is drawn from the same population as the training
and validation sets, but has not been used in any way to create the model.
190 Chapter 6

Error Rate

Prune here. Validation data

Training data

Depth of Tree
Figure 6.7 Pruning chooses the tree whose miscalculation rate is minimized on the
validation set.

The C5 Pruning Algorithm
C5 is the most recent version of the decision-tree algorithm that Australian
researcher, J. Ross Quinlan has been evolving and refining for many years. An
earlier version, ID3, published in 1986, was very influential in the field of
machine learning and its successors are used in several commercial data min­
ing products. (The name ID3 stands for “Iterative Dichotomiser 3.” We have
not heard an explanation for the name C5, but we can guess that Professor
Quinlan™s background is mathematics rather than marketing.) C5 is available
as a commercial product from RuleQuest (www.rulequest.com).
Decision Trees 191

The trees grown by C5 are similar to those grown by CART (although unlike
CART, C5 makes multiway splits on categorical variables). Like CART, the C5
algorithm first grows an overfit tree and then prunes it back to create a more
stable model. The pruning strategy is quite different, however. C5 does not
make use of a validation set to choose from among candidate subtrees; the
same data used to grow the tree is also used to decide how the tree should be
pruned. This may reflect the algorithm™s origins in the academic world, where
in the past, university researchers had a hard time getting their hands on sub­
stantial quantities of real data to use for training sets. Consequently, they spent
much time and effort trying to coax the last few drops of information from
their impoverished datasets”a problem that data miners in the business
world do not face.

Pessimistic Pruning
C5 prunes the tree by examining the error rate at each node and assuming that
the true error rate is actually substantially worse. If N records arrive at a node,
and E of them are classified incorrectly, then the error rate at that node is E/N.
Now the whole point of the tree-growing algorithm is to minimize this error
rate, so the algorithm assumes that E/N is the best than can be done.
C5 uses an analogy with statistical sampling to come up with an estimate of
the worst error rate likely to be seen at a leaf. The analogy works by thinking of
the data at the leaf as representing the results of a series of trials each of which
can have one of two possible results. (Heads or tails is the usual example.) As it
happens, statisticians have been studying this particular situation since at least
1713, the year that Jacques Bernoulli™s famous binomial formula was posthu­
mously published. So there are well-known formulas for determining what it
means to have observed E occurrences of some event in N trials.
In particular, there is a formula which, for a given confidence level, gives the
confidence interval”the range of expected values of E. C5 assumes that the
observed number of errors on the training data is the low end of this range,
and substitutes the high end to get a leaf™s predicted error rate, E/N on unseen
data. The smaller the node, the higher the error rate. When the high-end esti­
mate of the number of errors at a node is less than the estimate for the errors of
its children, then the children are pruned.

Stability-Based Pruning
The pruning algorithms used by CART and C5 (and indeed by all the com­
mercial decision tree tools that the authors have used) have a problem. They
fail to prune some nodes that are clearly unstable. The split highlighted in
Figure 6.8 is a good example. The picture was produced by SAS Enterprise
192 Chapter 6

Miner using its default settings for viewing a tree. The numbers on the left-
hand side of each node show what is happening on the training set. The num­
bers on the right-hand side of each node show what is happening on the
validation set. This particular tree is trying to identify churners. When only the
training data is taken into consideration, the highlighted branch seems to do
very well; the concentration of churners rises from 58.0 percent to 70.9 percent.
Unfortunately, when the very same rule is applied to the validation set, the
concentration of churners actually decreases from 56.6 percent to 52 percent.
One of the main purposes of a model is to make consistent predictions on
previously unseen records. Any rule that cannot achieve that goal should be
eliminated from the model. Many data mining tools allow the user to prune a
decision tree manually. This is a useful facility, but we look forward to data
mining software that provides automatic stability-based pruning as an option.

Such software would need to have a less subjective criterion for rejecting a

split than “the distribution of the validation set results looks different from the
distribution of the training set results.” One possibility would be to use a test
of statistical significance, such as the chi-Square Test or the difference of pro­
portions. The split would be pruned when the confidence level is less than
some user-defined threshold, so only splits that are, say, 99 percent confident
on the validation set would remain.

13.5% 13.8%
86.5% 86.2%
39,628 19,814
Handset Churn Rate

< 0.7% < 3.8% ≥ 3.8%
3.5% 3.0% 14.9% 15.6% 28.7% 29.3%
96.5% 97.0% 85.1% 84.4% 71.3% 70.7%
11,112 5,678 23,361 11,529 5,155 2,607
Call Trend
< 0.056 < 0.18 ≥ 0.18
58.0% 56.6% 39.2% 40.4% 27.0% 27.9%
42.0% 43.4% 60.8% 59.6% 73.0% 72.1%
219 99 148 57 440 218
Total Amt. Overdue
< 4,855 < 88,455 ≥ 88,455
67.3% 66.0% 70.9% 52.0% 25.9% 44.4%
32.7% 34.0% 29.1% 48.0% 74.1% 55.6%
110 47 55 25 54 27
Figure 6.8 An unstable split produces very different distributions on the training and
validation sets.

Decision Trees 193

WA R N I N G Small nodes cause big problems. A common cause of unstable
decision tree models is allowing nodes with too few records. Most decision tree
tools allow the user to set a minimum node size. As a rule of thumb, nodes that
receive fewer than about 100 training set records are likely to be unstable.


. 41
( 137 .)