second split yields one node that is 100 percent pure dark, but only has 6 dots

and another that that has 14 dots and is 71.4 percent light.

Which of these two proposed splits increases purity the most?

EVALUATING THE TWO SPLITS USING GINI

As explained in the main text, the Gini score for each of the two children in the

first proposed split is 0.12 + 0.92 = 0.820. Since the children are the same size,

this is also the score for the split.

What about the second proposed split? The Gini score of the left child is 1

since only one class is represented. The Gini score of the right child is

Giniright = (4/14)2 + (10/14)2 = 0.082 + 0.510 = 0.592

and the Gini score for the split is:

(6/20)Ginileft + (14/20)Giniright = 0.3*1 + 0.7*0.592 = 0.714

Since the Gini score for the first proposed split (0.820) is greater than for the

second proposed split (0.714), a tree built using the Gini criterion will prefer the

split that yields two nearly pure children over the split that yields one

completely pure child along with a larger, less pure one.

(continued)

182 Chapter 6

COMPARING TWO SPLITS USING GINI AND ENTROPY (continued)

EVALUATING THE TWO SPLITS USING ENTROPY

As calculated in the main text, the entropy of the parent node is 1. The entropy

of the first proposed split is also calculated in the main text and found to be

0.47 so the information gain for the first proposed split is 0.53.

How much information is gained by the second proposed split? The left child

is pure and so has entropy of 0. As for the right child, the formula for entropy is

-(P(dark)log2P(dark) + P(light)log2P(light))

so the entropy of the right child is:

Entropyright = -((4/14)log2(4/14) + (10/14)log2(10/14)) = 0.516 +

0.347 = 0.863

Y

The entropy of the split is the weighted average of the entropies of the

FL

resulting nodes. In this case,

0.3*Entropyleft + 0.7*Entropyright = 0.3*0 + 0.7*0.863 = 0.604

AM

Subtracting 0.604 from the entropy of the parent (which is 1) yields an

information gain of 0.396. This is less than 0.53, the information gain from the

first proposed split, so in this case, entropy splitting criterion also prefers the

first split to the second. Compared to Gini, the entropy criterion does have a

TE

stronger preference for nodes that are purer, even if smaller. This may be

appropriate in domains where there really are clear underlying rules, but it

tends to lead to less stable trees in “noisy” domains such as response to

marketing offers.

For example, suppose the target variable is a binary flag indicating whether

or not customers continued their subscriptions at the end of the introductory

offer period and the proposed split is on acquisition channel, a categorical

variable with three classes: direct mail, outbound call, and email. If the acqui

sition channel had no effect on renewal rate, we would expect the number of

renewals in each class to be proportional to the number of customers acquired

through that channel. For each channel, the chi-square test subtracts that

expected number of renewals from the actual observed renewals, squares the

difference, and divides the difference by the expected number. The values for

each class are added together to arrive at the score. As described in Chapter 5,

the chi-square distribution provide a way to translate this chi-square score into

a probability. To measure the purity of a split in a decision tree, the score is

sufficient. A high score means that the proposed split successfully splits the

population into subpopulations with significantly different distributions.

The chi-square test gives its name to CHAID, a well-known decision tree

algorithm first published by John A. Hartigan in 1975. The full acronym stands

for Chi-square Automatic Interaction Detector. As the phrase “automatic inter

action detector” implies, the original motivation for CHAID was for detecting

Team-Fly®

Decision Trees 183

statistical relationships between variables. It does this by building a decision

tree, so the method has come to be used as a classification tool as well. CHAID

makes use of the Chi-square test in several ways”first to merge classes that do

not have significantly different effects on the target variable; then to choose a

best split; and finally to decide whether it is worth performing any additional

splits on a node. In the research community, the current fashion is away from

methods that continue splitting only as long as it seems likely to be useful and

towards methods that involve pruning. Some researchers, however, still prefer

the original CHAID approach, which does not rely on pruning.

The chi-square test applies to categorical variables so in the classic CHAID

algorithm, input variables must be categorical. Continuous variables must be

binned or replaced with ordinal classes such as high, medium, low. Some cur

rent decision tree tools such as SAS Enterprise Miner, use the chi-square test

for creating splits using categorical variables, but use another statistical test,

the F test, for creating splits on continuous variables. Also, some implementa

tions of CHAID continue to build the tree even when the splits are not statisti

cally significant, and then apply pruning algorithms to prune the tree back.

Reduction in Variance

The four previous measures of purity all apply to categorical targets. When the

target variable is numeric, a good split should reduce the variance of the target

variable. Recall that variance is a measure of the tendency of the values in a

population to stay close to the mean value. In a sample with low variance,

most values are quite close to the mean; in a sample with high variance, many

values are quite far from the mean. The actual formula for the variance is the

mean of the sums of the squared deviations from the mean. Although the

reduction in variance split criterion is meant for numeric targets, the dark and

light dots in Figure 6.5 can still be used to illustrate it by considering the dark

dots to be 1 and the light dots to be 0. The mean value in the parent node is

clearly 0.5. Every one of the 20 observations differs from the mean by 0.5, so

the variance is (20 * 0.52) / 20 = 0.25. After the split, the left child has 9 dark

spots and one light spot, so the node mean is 0.9. Nine of the observations dif

fer from the mean value by 0.1 and one observation differs from the mean

value by 0.9 so the variance is (0.92 + 9 * 0.12) / 10 = 0.09. Since both nodes

resulting from the split have variance 0.09, the total variance after the split is

also 0.09. The reduction in variance due to the split is 0.25 “ 0.09 = 0.16.

F Test

Another split criterion that can be used for numeric target variables is the F test,

named for another famous Englishman”statistician, astronomer, and geneti

cist, Ronald. A. Fisher. Fisher and Pearson reportedly did not get along despite,

or perhaps because of, the large overlap in their areas of interest. Fisher™s test

184 Chapter 6

does for continuous variables what Pearson™s chi-square test does for categori

cal variables. It provides a measure of the probability that samples with differ

ent means and variances are actually drawn from the same population.

There is a well-understood relationship between the variance of a sample

and the variance of the population from which it was drawn. (In fact, so long

as the samples are of reasonable size and randomly drawn from the popula

tion, sample variance is a good estimate of population variance; very small

samples”with fewer than 30 or so observations”usually have higher vari

ance than their corresponding populations.) The F test looks at the relationship

between two estimates of the population variance”one derived by pooling all

the samples and calculating the variance of the combined sample, and one

derived from the between-sample variance calculated as the variance of the

sample means. If the various samples are randomly drawn from the same

population, these two estimates should agree closely.

The F score is the ratio of the two estimates. It is calculated by dividing the

between-sample estimate by the pooled sample estimate. The larger the score,

the less likely it is that the samples are all randomly drawn from the same

population. In the decision tree context, a large F-score indicates that a pro

posed split has successfully split the population into subpopulations with

significantly different distributions.

Pruning

As previously described, the decision tree keeps growing as long as new splits

can be found that improve the ability of the tree to separate the records of the

training set into increasingly pure subsets. Such a tree has been optimized for

the training set, so eliminating any leaves would only increase the error rate of

the tree on the training set. Does this imply that the full tree will also do the

best job of classifying new datasets? Certainly not!

A decision tree algorithm makes its best split first, at the root node where

there is a large population of records. As the nodes get smaller, idiosyncrasies

of the particular training records at a node come to dominate the process. One

way to think of this is that the tree finds general patterns at the big nodes and

patterns specific to the training set in the smaller nodes; that is, the tree over-

fits the training set. The result is an unstable tree that will not make good

predictions. The cure is to eliminate the unstable splits by merging smaller

leaves through a process called pruning; three general approaches to pruning

are discussed in detail.

Decision Trees 185

The CART Pruning Algorithm

CART is a popular decision tree algorithm first published by Leo Breiman,

Jerome Friedman, Richard Olshen, and Charles Stone in 1984. The acronym

stands for Classification and Regression Trees. The CART algorithm grows

binary trees and continues splitting as long as new splits can be found that

increase purity. As illustrated in Figure 6.6, inside a complex tree, there are

many simpler subtrees, each of which represents a different trade-off between

model complexity and training set misclassification rate. The CART algorithm

identifies a set of such subtrees as candidate models. These candidate subtrees

are applied to the validation set and the tree with the lowest validation set mis-

classification rate is selected as the final model.

Creating the Candidate Subtrees

The CART algorithm identifies candidate subtrees through a process of

repeated pruning. The goal is to prune first those branches providing the least

additional predictive power per leaf. In order to identify these least useful

branches, CART relies on a concept called the adjusted error rate. This is a mea

sure that increases each node™s misclassification rate on the training set by

imposing a complexity penalty based on the number of leaves in the tree. The

adjusted error rate is used to identify weak branches (those whose misclassifi

cation rate is not low enough to overcome the penalty) and mark them for

pruning.

Figure 6.6 Inside a complex tree, there are simpler, more stable trees.

186 Chapter 6

COMPARING MISCLASSIFICAION RATES ON TRAINING AND

VALIDATION SETS

The error rate on the validation set should be larger than the error rate on the

training set, because the training set was used to build the rules in the model.

A large difference in the misclassification error rate, however, is a symptom of

an unstable model. This difference can show up in several ways as shown by

the following three graphs generated by SAS Enterprise Miner. The graphs

represent the percent of records correctly classified by the candidate models in

a decision tree. Candidate subtrees with fewer nodes are on the left; with more

nodes are on the right. These figures show the percent correctly classified

instead of the error rate, so they are upside down from the way similar charts

are shown elsewhere in this book.

As expected, the first chart shows the candidate trees performing better and

better on the training set as the trees have more and more nodes”the training

process stops when the performance no longer improves. On the validation set,

however, the candidate trees reach a peak and then the performance starts to

decline as the trees get larger. The optimal tree is the one that works on the

validation set, and the choice is easy because the peak is well-defined.

This chart shows a clear inflection point in the graph of the percent correctly classified

in the validation set.

Decision Trees 187

COMPARING MISCLASSIFICAION RATES ON TRAINING AND

VALIDATION SETS (continued)

Sometimes, though, there is not clear demarcation point. That is, the

performance of the candidate models on the validation set never quite reaches

a maximum as the trees get larger. In this case, the pruning algorithm chooses

the entire tree (the largest possible subtree), as shown in the following

illustration:

Proportion Correctly Classified

0.88

0.86

0.84

0.82

0.80

0.78

0.76

0.74

0.72

0.70

0.68

0.66

0.64

0.62

0.60

0.58

0.56

0.54

0.52

0.50