ñòð. 38 |

merely splits records into two groups. A score allows the records to be sorted

from most likely to least likely to be members of the desired class.

Figure 6.2 A decision tree annotated with the proportion of records in class 1 at each

node shows the probability of the classification.

170 Chapter 6

For many applications, a score capable of rank-ordering a list is all that is

required. This is sufficient to choose the top N percent for a mailing and to calÂ

culate lift at various depths in the list. For some applications, however, it is not

sufficient to know that A is more likely to respond than B; we want to know

that actual likelihood of a response from A. Assuming that the prior probabilÂ

ity of a response is known, it can be used to calculate the probability of

response from the score generated on the oversampled data used to build the

tree. Alternatively, the model can be applied to preclassified data that has a

distribution of responses that reflects the true population. This method, called

backfitting, creates scores using the class proportions at the treeâ€™s leaves to repÂ

resent the probability that a record drawn from a similar population is a memÂ

ber of the class. These, and related issues, are discussed in detail in Chapter 3.

Estimation

Suppose the important business question is not who will respond but what will

be the size of the customerâ€™s next order? The decision tree can be used to answer

that question too. Assuming that order amount is one of the variables available

in the preclassified model set, the average order size in each leaf can be used as

the estimated order size for any unclassified record that meets the criteria for

that leaf. It is even possible to use a numeric target variable to build the tree;

such a tree is called a regression tree. Instead of increasing the purity of a cateÂ

gorical variable, each split in the tree is chosen to decrease the variance in the

values of the target variable within each child node.

The fact that trees can be (and sometimes are) used to estimate continuous

values does not make it a good idea. A decision tree estimator can only generÂ

ate as many discrete values as there are leaves in the tree. To estimate a continÂ

uous variable, it is preferable to use a continuous function. Regression models

and neural network models are generally more appropriate for estimation.

Trees Grow in Many Forms

The tree in Figure 6.1 is a binary tree of nonuniform depth; that is, each nonleaf

node has two children and leaves are not all at the same distance from the root.

In this case, each node represents a yes-or-no question, whose answer deterÂ

mines by which of two paths a record proceeds to the next level of the tree. Since

any multiway split can be expressed as a series of binary splits, there is no real

need for trees with higher branching factors. Nevertheless, many data mining

tools are capable of producing trees with more than two branches. For example,

some decision tree algorithms split on categorical variables by creating a branch

for each class, leading to trees with differing numbers of branches at different

nodes. Figure 6.3 illustrates a tree that uses both three-way and two-way splits

for the same classification problem as the tree in Figures 6.1 and 6.2.

Decision Trees 171

50%

tot units demand

< 4.5 â‰¥ 15.5

[4.5, 15.5]

40% 47% 66%

num VOM orders days since last days since last

<1 â‰¥1 â‰¥ 725 < 756 â‰¥ 756

1 221.5

39% 42% 58% 52% 37% 72% 40%

average $ demand GMM buyer flag

< 47.6 â‰¥ 47.6 0 1

40% 33% 52% 54%

avg $/month tot $ 9604

<2 2.10132,4. â‰¥ 4.116 9.15, 4 â‰¥ 41.255

[

37% 48% 85% 49% 69% 50%

Figure 6.3 This ternary decision tree is applied to the same the same classification

problem as in Figure 6.1.

T I P There is no relationship between the number of branches allowed at a

node and the number of classes in the target variable. A binary tree (that is,

one with two-way splits) can be used to classify records into any number of

categories, and a tree with multiway splits can be used to classify a binary

target variable.

How a Decision Tree Is Grown

Although there are many variations on the core decision tree algorithm, all of

them share the same basic procedure: Repeatedly split the data into smaller

and smaller groups in such a way that each new generation of nodes has

greater purity than its ancestors with respect to the target variable. For most of

this discussion, we assume a binary, categorical target variable, such as

responder/nonresponder. This simplifies the explanations without much loss

of generality.

172 Chapter 6

Finding the Splits

At the start of the process, there is a training set consisting of preclassified

recordsâ€”that is, the value of the target variable is known for all cases. The goal

is to build a tree that assigns a class (or a likelihood of membership in each class)

to the target field of a new record based on the values of the input variables.

The tree is built by splitting the records at each node according to a function

of a single input field. The first task, therefore, is to decide which of the input

fields makes the best split. The best split is defined as one that does the best job

of separating the records into groups where a single class predominates in

each group.

The measure used to evaluate a potential split is purity. The next section

talks about specific methods for calculating purity in more detail. However,

Y

they are all trying to achieve the same effect. With all of them, low purity

FL

means that the set contains a representative distribution of classes (relative to

the parent node), while high purity means that members of a single class preÂ

dominate. The best split is the one that increases the purity of the record sets

AM

by the greatest amount. A good split also creates nodes of similar size, or at

least does not create nodes containing very few records.

These ideas are easy to see visually. Figure 6.4 illustrates some good and bad

TE

splits.

Original Data

Poor Split Poor Split

Good Split

Figure 6.4 A good split increases purity for all the children.

Team-FlyÂ®

Decision Trees 173

The first split is a poor one because there is no increase in purity. The initial

population contains equal numbers of the two sorts of dot; after the split, so

does each child. The second split is also poor, because all though purity is

increased slightly, the pure node has few members and the purity of the larger

child is only marginally better than that of the parent. The final split is a good

one because it leads to children of roughly same size and with much higher

purity than the parent.

Tree-building algorithms are exhaustive. They proceed by taking each input

variable in turn and measuring the increase in purity that results from every

split suggested by that variable. After trying all the input variables, the one

that yields the best split is used for the initial split, creating two or more chilÂ

dren. If no split is possible (because there are too few records) or if no split

makes an improvement, then the algorithm is finished with that node and the

node become a leaf node. Otherwise, the algorithm performs the split and

repeats itself on each of the children. An algorithm that repeats itself in this

way is called a recursive algorithm.

Splits are evaluated based on their effect on node purity in terms of the tarÂ

get variable. This means that the choice of an appropriate splitting criterion

depends on the type of the target variable, not on the type of the input variable.

With a categorical target variable, a test such as Gini, information gain, or chi-

square is appropriate whether the input variable providing the split is numeric

or categorical. Similarly, with a continuous, numeric variable, a test such as

variance reduction or the F-test is appropriate for evaluating the split regardÂ

less of whether the input variable providing the split is categorical or numeric.

Splitting on a Numeric Input Variable

When searching for a binary split on a numeric input variable, each value that

the variable takes on in the training set is treated as a candidate value for

the split. Splits on a numeric variable take the form X<N. All records where the

value of X (the splitting variable) is less than some constant N are sent to one

child and all records where the value of X is greater than or equal to N are sent

to the other. After each trial split, the increase in purity, if any, due to the

split is measured. In the interests of efficiency, some implementations of

the splitting algorithm do not actually evaluate every value; they evaluate a

sample of the values instead.

When the decision tree is scored, the only use that it makes of numeric

inputs is to compare their values with the split points. They are never multiÂ

plied by weights or added together as they are in many other types of models.

This has the important consequence that decision trees are not sensitive to outÂ

liers or skewed distributions of numeric variables, because the tree only uses

the rank of numeric variables and not their absolute values.

174 Chapter 6

Splitting on a Categorical Input Variable

The simplest algorithm for splitting on a categorical input variable is simply to

create a new branch for each class that the categorical variable can take on. So,

if color is chosen as the best field on which to split the root node, and the trainÂ

ing set includes records that take on the values red, orange, yellow, green, blue,

indigo, and violet, then there will be seven nodes in the next level of the tree.

This approach is actually used by some software packages, but it often yields

poor results. High branching factors quickly reduce the population of training

records available at each node in lower levels of the tree, making further splits

less reliable.

A more common approach is to group together the classes that, taken indiÂ

vidually, predict similar outcomes. More precisely, if two classes of the input

variable yield distributions of the classes of the output variable that do not

differ significantly from one another, the two classes can be merged. The usual

test for whether the distributions differ significantly is the chi-square test.

Splitting in the Presence of Missing Values

One of the nicest things about decision trees is their ability to handle missing

values in either numeric or categorical input fields by simply considering null

to be a possible value with its own branch. This approach is preferable to

throwing out records with missing values or trying to impute missing values.

Throwing out records due to missing values is likely to create a biased training

set because the records with missing values are not likely to be a random samÂ

ple of the population. Replacing missing values with imputed values has the

risk that important information provided by the fact that a value is missing will

be ignored in the model. We have seen many cases where the fact that a particÂ

ular value is null has predictive value. In one such case, the count of non-null

values in appended household-level demographic data was positively correÂ

lated with response to an offer of term life insurance. Apparently, people who

leave many traces in Acxiomâ€™s household database (by buying houses, getting

married, registering products, and subscribing to magazines) are more likely to

be interested in life insurance than those whose lifestyles leave more fields null.

T I P Decision trees can produce splits based on missing values of an input

variable. The fact that a value is null can often have predictive value so do not

be hasty to filter out records with missing values or to try to replace them with

imputed values.

Although splitting on null as a separate class is often quite valuable, at least

one data mining product offers an alternative approach as well. In Enterprise

Miner, each node stores several possible splitting rules, each one based on a

different input field. When a null value is encountered in the field that yields

Decision Trees 175

the best splits, the software uses the surrogate split based on the next best availÂ

able input variable.

Growing the Full Tree

The initial split produces two or more child nodes, each of which is then split in

the same manner as the root node. Once again, all input fields are considered as

candidate splitters, even fields already used for splits. However, fields that take

on only one value are eliminated from consideration since there is no way that

they can be used to create a split. A categorical field that has been used as a

splitter higher up in the tree is likely to become single-valued fairly quickly. The

best split for each of the remaining fields is determined. When no split can be

found that significantly increases the purity of a given node, or when the

ñòð. 38 |