ñòð. 39 |

the depth of the tree reaches some preset limit, the split search for that branch

is abandoned and the node is labeled as a leaf node.

Eventually, it is not possible to find any more splits anywhere in the tree and

the full decision tree has been grown. As we will see, this full tree is generally

not the tree that does the best job of classifying a new set of records.

Decision-tree-building algorithms begin by trying to find the input variable

that does the best job of splitting the data among the desired categories. At each

succeeding level of the tree, the subsets created by the preceding split are themÂ

selves split according to whatever rule works best for them. The tree continues

to grow until it is no longer possible to find better ways to split up incoming

records. If there were a completely deterministic relationship between the input

variables and the target, this recursive splitting would eventually yield a tree

with completely pure leaves. It is easy to manufacture examples of this sort, but

they do not occur very often in marketing or CRM applications.

Customer behavior data almost never contains such clear, deterministic

relationships between inputs and outputs. The fact that two customers have

the exact same description in terms of the available input variables does not

ensure that they will exhibit the same behavior. A decision tree for a catalog

response model might include a leaf representing females with age greater

than 50, three or more purchases within the last year, and total lifetime spendÂ

ing of over $145. The customers reaching this leaf will typically be a mix of

responders and nonresponders. If the leaf in question is labeled â€œresponder,â€

than the proportion of nonresponders is the error rate for this leaf. The ratio of

the proportion of responders in this leaf to the proportion of responders in the

population is the lift at this leaf.

One circumstance where deterministic rules are likely to be discovered is

when patterns in data reflect business rules. The authors had this fact driven

home to them by an experience at Caterpillar, a manufacturer of diesel

engines. We built a decision tree model to predict which warranty claims

would be approved. At the time, the company had a policy by which certain

176 Chapter 6

claims were paid automatically. The results were startling: The model was 100

percent accurate on unseen test data. In other words, it had discovered the

exact rules used by Caterpillar to classify the claims. On this problem, a neural

network tool was less successful. Of course, discovering known business rules

may not be particularly useful; it does, however, underline the effectiveness of

decision trees on rule-oriented problems.

Many domains, ranging from genetics to industrial processes really do have

underlying rules, though these may be quite complex and obscured by noisy

data. Decision trees are a natural choice when you suspect the existence of

underlying rules.

Measuring the Effectiveness Decision Tree

The effectiveness of a decision tree, taken as a whole, is determined by applyÂ

ing it to the test setâ€”a collection of records not used to build the treeâ€”and

observing the percentage classified correctly. This provides the classification

error rate for the tree as a whole, but it is also important to pay attention to the

quality of the individual branches of the tree. Each path through the tree repÂ

resents a rule, and some rules are better than others.

At each node, whether a leaf node or a branching node, we can measure:

The number of records entering the node

â– â–

The proportion of records in each class

â– â–

How those records would be classified if this were a leaf node

â– â–

The percentage of records classified correctly at this node

â– â–

The variance in distribution between the training set and the test set

â– â–

Of particular interest is the percentage of records classified correctly at this

node. Surprisingly, sometimes a node higher up in the tree does a better job of

classifying the test set than nodes lower down.

Tests for Choosing the Best Split

A number of different measures are available to evaluate potential splits. AlgoÂ

rithms developed in the machine learning community focus on the increase in

purity resulting from a split, while those developed in the statistics commuÂ

nity focus on the statistical significance of the difference between the distribuÂ

tions of the child nodes. Alternate splitting criteria often lead to trees that look

quite different from one another, but have similar performance. That is

because there are usually many candidate splits with very similar perforÂ

mance. Different purity measures lead to different candidates being selected,

but since all of the measures are trying to capture the same idea, the resulting

models tend to behave similarly.

Decision Trees 177

Purity and Diversity

The first edition of this book described splitting criteria in terms of the decrease

in diversity resulting from the split. In this edition, we refer instead to the

increase in purity, which seems slightly more intuitive. The two phrases refer to

the same idea. A purity measure that ranges from 0 (when no two items in the

sample are in the same class) to 1 (when all items in the sample are in the same

class) can be turned into a diversity measure by subtracting it from 1. Some of

the measures used to evaluate decision tree splits assign the lowest score to a

pure node; others assign the highest score to a pure node. This discussion

refers to all of them as purity measures, and the goal is to optimize purity by

minimizing or maximizing the chosen measure.

Figure 6.5 shows a good split. The parent node contains equal numbers of

light and dark dots. The left child contains nine light dots and one dark dot.

The right child contains nine dark dots and one light dot. Clearly, the purity

has increased, but how can the increase be quantified? And how can this split

be compared to others? That requires a formal definition of purity, several of

which are listed below.

Figure 6.5 A good split on a binary categorical variable increases purity.

178 Chapter 6

Purity measures for evaluating splits for categorical target variables include:

Gini (also called population diversity)

â– â–

Entropy (also called information gain)

â– â–

Information gain ratio

â– â–

Chi-square test

â– â–

When the target variable is numeric, one approach is to bin the value and use

one of the above measures. There are, however, two measures in common

use for numeric targets:

Reduction in variance

â– â–

F test

â– â–

Note that the choice of an appropriate purity measure depends on whether

the target variable is categorical or numeric. The type of the input variable does

not matter, so an entire tree is built with the same purity measure. The split

illustrated in 6.5 might be provided by a numeric input variable (AGE > 46) or

by a categorical variable (STATE is a member of CT, MA, ME, NH, RI, VT). The

purity of the children is the same regardless of the type of split.

Gini or Population Diversity

One popular splitting criterion is named Gini, after Italian statistician and

economist, Corrado Gini. This measure, which is also used by biologists and

ecologists studying population diversity, gives the probability that two items

chosen at random from the same population are in the same class. For a pure

population, this probability is 1.

The Gini measure of a node is simply the sum of the squares of the proporÂ

tions of the classes. For the split shown in Figure 6.5, the parent population has

an equal number of light and dark dots. A node with equal numbers of each of

2 classes has a score of 0.52 + 0.52 = 0.5, which is expected because the chance of

picking the same class twice by random selection with replacement is one out

of two. The Gini score for either of the resulting nodes is 0.12 + 0.92 = 0.82. A

perfectly pure node would have a Gini score of 1. A node that is evenly balÂ

anced would have a Gini score of 0.5. Sometimes the scores is doubled and

then 1 subtracted, so it is between 0 and 1. However, such a manipulation

makes no difference when comparing different scores to optimize purity.

To calculate the impact of a split, take the Gini score of each child node and

multiply it by the proportion of records that reach that node and then sum the

resulting numbers. In this case, since the records are split evenly between

the two nodes resulting from the split and each node has the same Gini score,

the score for the split is the same as for either of the two nodes.

Decision Trees 179

Entropy Reduction or Information Gain

Information gain uses a clever idea for defining purity. If a leaf is entirely pure,

then the classes in the leaf can be easily describedâ€”they all fall in the same

class. On the other hand, if a leaf is highly impure, then describing it is much

more complicated. Information theory, a part of computer science, has devised

a measure for this situation called entropy. In information theory, entropy is a

measure of how disorganized a system is. A comprehensive introduction to

information theory is far beyond the scope of this book. For our purposes, the

intuitive notion is that the number of bits required to describe a particular sitÂ

uation or outcome depends on the size of the set of possible outcomes. Entropy

can be thought of as a measure of the number of yes/no questions it would

take to determine the state of the system. If there are 16 possible states, it takes

log2(16), or four bits, to enumerate them or identify a particular one. AddiÂ

tional information reduces the number of questions needed to determine the

state of the system, so information gain means the same thing as entropy

reduction. Both terms are used to describe decision tree algorithms.

The entropy of a particular decision tree node is the sum, over all the classes

represented in the node, of the proportion of records belonging to a particular

class multiplied by the base two logarithm of that proportion. (Actually, this

sum is usually multiplied by â€“1 in order to obtain a positive number.) The

entropy of a split is simply the sum of the entropies of all the nodes resulting

from the split weighted by each nodeâ€™s proportion of the records. When

entropy reduction is chosen as a splitting criterion, the algorithm searches for

the split that reduces entropy (or, equivalently, increases information) by the

greatest amount.

For a binary target variable such as the one shown in Figure 6.5, the formula

for the entropy of a single node is

-1 * ( P(dark)log2P(dark) + P(light)log2P(light) )

In this example, P(dark) and P(light) are both one half. Plugging 0.5 into the

entropy formula gives:

-1 * (0.5 log2(0.5) + 0.5 log2(0.5))

The first term is for the light dots and the second term is for the dark dots,

but since there are equal numbers of light and dark dots, the expression simÂ

plifies to â€“1 * log2(0.5) which is +1. What is the entropy of the nodes resulting

from the split? One of them has one dark dot and nine light dots, while the

other has nine dark dots and one light dots. Clearly, they each have the same

level of entropy. Namely,

-1 * (0.1 log2(0.1) + 0.9 log2(0.9)) = 0.33 + 0.14 = 0.47

180 Chapter 6

To calculate the total entropy of the system after the split, multiply the

entropy of each node by the proportion of records that reach that node and

add them up to get an average. In this example, each of the new nodes receives

half the records, so the total entropy is the same as the entropy of each of the

nodes, 0.47. The total entropy reduction or information gain due to the split is

therefore 0.53. This is the figure that would be used to compare this split with

other candidates.

Information Gain Ratio

The entropy split measure can run into trouble when combined with a splitting

methodology that handles categorical input variables by creating a separate

branch for each value. This was the case for ID3, a decision tree tool developed

by Australian researcher J. Ross Quinlan in the nineteen-eighties, that became

part of several commercial data mining software packages. The problem is that

just by breaking the larger data set into many small subsets , the number of

classes represented in each node tends to go down, and with it, the entropy. The

decrease in entropy due solely to the number of branches is called the intrinsic

information of a split. (Recall that entropy is defined as the sum over all the

branches of the probability of each branch times the log base 2 of that probabilÂ

ity. For a random n-way split, the probability of each branch is 1/n. Therefore,

the entropy due solely to splitting from an n-way split is simply n * 1/n log

(1/n) or log(1/n). Because of the intrinsic information of many-way splits,

decision trees built using the entropy reduction splitting criterion without any

correction for the intrinsic information due to the split tend to be quite bushy.

Bushy trees with many multi-way splits are undesirable as these splits lead to

small numbers of records in each node, a recipe for unstable models.

In reaction to this problem, C5 and other descendents of ID3 that once used

information gain now use the ratio of the total information gain due to a proÂ

posed split to the intrinsic information attributable solely to the number of

branches created as the criterion for evaluating proposed splits. This test

reduces the tendency towards very bushy trees that was a problem in earlier

decision tree software packages.

Chi-Square Test

As described in Chapter 5, the chi-square (X2) test is a test of statistical signifiÂ

cance developed by the English statistician Karl Pearson in 1900. Chi-square is

defined as the sum of the squares of the standardized differences between the

expected and observed frequencies of some occurrence between multiple disjoint

samples. In other words, the test is a measure of the probability that an

observed difference between samples is due only to chance. When used to

measure the purity of decision tree splits, higher values of chi-square mean

that the variation is more significant, and not due merely to chance.

Decision Trees 181

COMPARING TWO SPLITS USING GINI AND ENTROPY

Consider the following two splits, illustrated in the figure below. In both cases,

the population starts out perfectly balanced between dark and light dots with

ten of each type. One proposed split is the same as in Figure 6.5 yielding two

ñòð. 39 |