. 42
( 137 .)


Extracting Rules from Trees
When a decision tree is used primarily to generate scores, it is easy to forget
that a decision tree is actually a collection of rules. If one of the purposes of the
data mining effort is to gain understanding of the problem domain, it can be
useful to reduce the huge tangle of rules in a decision tree to a smaller, more
comprehensible collection.
There are other situations where the desired output is a set of rules. In
Mastering Data Mining, we describe the application of decision trees to an
industrial process improvement problem, namely the prevention of a certain
type of printing defect. In that case, the end product of the data mining project
was a small collection of simple rules that could be posted on the wall next to
each press.
When a decision tree is used for producing scores, having a large number of
leaves is advantageous because each leaf generates a different score. When the
object is to generate rules, the fewer rules the better. Fortunately, it is often pos­
sible to collapse a complex tree into a smaller set of rules.
The first step in that direction is to combine paths that lead to leaves that
make the same classification. The partial decision tree in Figure 6.9 yields the
following rules:
Watch the game and home team wins and out with friends then beer.
Watch the game and home team wins and sitting at home then diet soda.
Watch the game and home team loses and out with friends then beer.
Watch the game and home team loses and sitting at home then milk.
The two rules that predict beer can be combined by eliminating the test for
whether the home team wins or loses. That test is important for discriminating
between milk and diet soda, but has no bearing on beer consumption. The
new, simpler rule is:
Watch the game and out with friends then beer.
194 Chapter 6

Watch the game?

No Yes
Home team wins?

No Yes
Out with friends? Out with friends?

No Yes
No Yes

Diet soda Beer Milk Beer
Figure 6.9 Multiple paths lead to the same conclusion.

Up to this point, nothing is controversial because no information has been
lost, but C5™s rule generator goes farther. It attempts to generalize each rule by
removing clauses, then comparing the predicted error rate of the new, briefer
rule to that of the original using the same pessimistic error rate assumption
used for pruning the tree in the first place. Often, the rules for several different
leaves generalize to the same rule, so this process results in fewer rules than
the decision tree had leaves.
In the decision tree, every record ends up at exactly one leaf, so every record
has a definitive classification. After the rule-generalization process, however,
there may be rules that are not mutually exclusive and records that are not cov­
ered by any rule. Simply picking one rule when more than one is applicable
can solve the first problem. The second problem requires the introduction of a
default class assigned to any record not covered by any of the rules. Typically,
the most frequently occurring class is chosen as the default.
Once it has created a set of generalized rules, Quinlan™s C5 algorithm
groups the rules for each class together and eliminates those that do not seem
to contribute much to the accuracy of the set of rules as a whole. The end result
is a small number of easy to understand rules.
Decision Trees 195

Taking Cost into Account

In the discussion so far, the error rate has been the sole measure for evaluating
the fitness of rules and subtrees. In many applications, however, the costs of
misclassification vary from class to class. Certainly, in a medical diagnosis, a
false negative can be more harmful than a false positive; a scary Pap smear
result that, on further investigation, proves to have been a false positive, is
much preferable to an undetected cancer. A cost function multiplies the prob­
ability of misclassification by a weight indicating the cost of that misclassifica­
tion. Several tools allow the use of such a cost function instead of an error
function for building decision trees.

Further Refinements to the Decision Tree Method
Although they are not found in most commercial data mining software pack­
ages, there are some interesting refinements to the basic decision tree method
that are worth discussing.

Using More Than One Field at a Time
Most decision tree algorithms test a single variable to perform each split. This
approach can be problematic for several reasons, not least of which is that it
can lead to trees with more nodes than necessary. Extra nodes are cause for
concern because only the training records that arrive at a given node are avail­
able for inducing the subtree below it. The fewer training examples per node,
the less stable the resulting model.
Suppose that we are interested in a condition for which both age and gender
are important indicators. If the root node split is on age, then each child node
contains only about half the women. If the initial split is on gender, then each
child node contains only about half the old folks.
Several algorithms have been developed to allow multiple attributes to be
used in combination to form the splitter. One technique forms Boolean con­
junctions of features in order to reduce the complexity of the tree. After find­
ing the feature that forms the best split, the algorithm looks for the feature
which, when combined with the feature chosen first, does the best job of
improving the split. Features continue to be added as long as there continues
to be a statistically significant improvement in the resulting split.
This procedure can lead to a much more efficient representation of classifi­
cation rules. As an example, consider the task of classifying the results of a
vote according to whether the motion was passed unanimously. For simplicity,
consider the case where there are only three votes cast. (The degree of simpli­
fication to be made only increases with the number of voters.)
Table 6.1 contains all possible combinations of three votes and an added col­
umn to indicate the unanimity of the result.
196 Chapter 6

Table 6.1 All Possible Combinations of Votes by Three Voters


Nay Nay Nay TRUE

Nay Nay Aye FALSE

Nay Aye Nay FALSE

Nay Aye Aye FALSE

Aye Nay Nay FALSE

Aye Nay Aye FALSE

Aye Aye Nay FALSE

Aye Aye Aye TRUE

Figure 6.10 shows a tree that perfectly classifies the training data, requiring
five internal splitting nodes. Do not worry about how this tree is created, since
that is unnecessary to the point we are making.
Allowing features to be combined using the logical and function to form
conjunctions yields the much simpler tree in Figure 6.11. The second tree illus­
trates another potential advantage that can arise from using combinations of
fields. The tree now comes much closer to expressing the notion of unanimity
that inspired the classes: “When all voters agree, the decision is unanimous.”

Voter #1

Yes No
Voter #2 Voter #2

Yes No Yes No

Voter #3 Voter #3
False False

Yes No Yes No

True False False True
Figure 6.10 The best binary tree for the unanimity function when splitting on single fields.
Decision Trees 197

Voter #1 and Voter #2 and Voter #3 all vote yes?

Yes No Voter #1 and Voter #2 and
True Voter #3 all vote no?

Yes No

True False
Figure 6.11 Combining features simplifies the tree for defining unanimity.

A tree that can be understood all at once is said, by machine learning
researchers, to have good “mental fit.” Some researchers in the machine learn­
ing field attach great importance to this notion, but that seems to be an artifact
of the tiny, well-structured problems around which they build their studies. In
the real world, if a classification task is so simple that you can get your mind
around the entire decision tree that represents it, you probably don™t need to
waste your time with powerful data mining tools to discover it. We believe
that the ability to understand the rule that leads to any particular leaf is very
important; on the other hand, the ability to interpret an entire decision tree at
a glance is neither important nor likely to be possible outside of the laboratory.

Tilting the Hyperplane
Classification problems are sometimes presented in geometric terms. This way
of thinking is especially natural for datasets having continuous variables for
all fields. In this interpretation, each record is a point in a multidimensional
space. Each field represents the position of the record along one axis of the
space. Decision trees are a way of carving the space into regions, each of which
is labeled with a class. Any new record that falls into one of the regions is clas­
sified accordingly.
Traditional decision trees, which test the value of a single field at each node,
can only form rectangular regions. In a two-dimensional space, a test of the form
Y less than some constant forms a region bounded by a line perpendicular to
the Y-axis and parallel to the X-axis. Different values for the constant cause the
line to move up and down, but the line remains horizontal. Similarly, in a space
of higher dimensionality, a test on a single field defines a hyperplane that is per­
pendicular to the axis represented by the field used in the test and parallel to all
the other axes. In a two-dimensional space, with only horizontal and vertical
lines to work with, the resulting regions are rectangular. In three-dimensional
198 Chapter 6

space, the corresponding shapes are rectangular solids, and in any multidi­
mensional space, there are hyper-rectangles.
The problem is that some things don™t fit neatly into rectangular boxes.
Figure 6.12 illustrates the problem: The two regions are really divided by a
diagonal line; it takes a deep tree to generate enough rectangles to approxi­
mate it adequately.
In this case, the true solution can be found easily by allowing linear combi­
nations of the attributes to be considered. Some software packages attempt to
tilt the hyperplanes by basing their splits on a weighted sum of the values of the
fields. There are a variety of hill-climbing approaches for selecting the weights.
Of course, it is easy to come up with regions that are not captured easily
even when diagonal lines are allowed. Regions may have curved boundaries
and fields may have to be combined in more complex ways (such as multiply­
ing length by width to get area). There is no substitute for the careful selection
of fields to be inputs to the tree-building process and, where necessary, the cre­
ation of derived fields that capture relationships known or suspected by
domain experts. These derived fields may be functions of several other fields.
Such derived fields inserted manually serve the same purpose as automati­
cally combining fields to tilt the hyperplane.

Figure 6.12 The upper-left and lower-right quadrants are easily classified, while the other
two quadrants must be carved up into many small boxes to approximate the boundary
between the regions.
Decision Trees 199

Neural Trees
One way of combining input from many fields at every node is to have each
node consist of a small neural network. For domains where rectangular
regions do a poor job describing the true shapes of the classes, neural trees can
produce more accurate classifications, while being quicker to train and to score
than pure neural networks.
From the point of view of the user, this hybrid technique has more in com­
mon with neural-network variants than it does with decision-tree variants
because, in common with other neural-network techniques, it is not capable of
explaining its decisions. The tree still produces rules, but these are of the form
F(w1x1, w2x2,w3x3, . . .) ¤ N, where F is the combining function used by the
neural network. Such rules make more sense to neural network software than
to people.


. 42
( 137 .)