. 43
( 137 .)


Piecewise Regression Using Trees
Another example of combining trees with other modeling methods is a form of
piecewise linear regression in which each split in a decision tree is chosen so as
to minimize the error of a simple regression model on the data at that node.
The same method can be applied to logistic regression for categorical target

Alternate Representations for Decision Trees
The traditional tree diagram is a very effective way of representing the actual
structure of a decision tree. Other representations are sometimes more useful
when the focus is more on the relative sizes and concentrations of the nodes.

Box Diagrams
While the tree diagram and Twenty Questions analogy are helpful in visualiz­
ing certain properties of decision-tree methods, in some cases, a box diagram
is more revealing. Figure 6.13 shows the box diagram representation of a deci­
sion tree that tries to classify people as male or female based on their ages and
the movies they have seen recently. The diagram may be viewed as a sort of
nested collection of two-dimensional scatter plots.
At the root node of a decision tree, the first three-way split is based on which
of three groups the survey respondent™s most recently seen movie falls. In the
outermost box of the diagram, the horizontal axis represents that field. The out­
ermost box is divided into sections, one for each node at the next level of the tree.
The size of each section is proportional to the number of records that fall into it.
Next, the vertical axis of each box is used to represent the field that is used as the
next splitter for that node. In general, this will be a different field for each box.
200 Chapter 6

Last Movie in Group Last Movie in Group Last Movie in Group
1 2 3
age > 27 age > 41

Last Movie Last Movie
in Group in Group
3 3
age ¤ 41 age ¤ 41
age > 27 age ¤ 27

Last Movie
in Group
age < 27

Figure 6.13 A box diagram represents a decision tree. Shading is proportional to the
purity of the box; size is proportional to the number of records that land there.

There is now a new set of boxes, each of which represents a node at the third
level of the tree. This process continues, dividing boxes until the leaves of the
tree each have their own box. Since decision trees often have nonuniform
depth, some boxes may be subdivided more often than others. Box diagrams
make it easy to represent classification rules that depend on any number of
variables on a two-dimensional chart.
The resulting diagram is very expressive. As we toss records onto the grid,
they fall into a particular box and are classified accordingly. A box chart allows
us to look at the data at several levels of detail. Figure 6.13 shows at a glance
that the bottom left contains a high concentration of males.
Taking a closer look, we find some boxes that seem to do a particularly good
job at classification or collect a large number of records. Viewed this way, it is
natural to think of decision trees as a way of drawing boxes around groups of
similar points. All of the points within a particular box are classified the same
way because they all meet the rule defining that box. This is in contrast to clas­
sical statistical classification methods such as linear, logistic, and quadratic
discriminants that attempt to partition data into classes by drawing a line or
elliptical curve through the data space. This is a fundamental distinction: Sta­
tistical approaches that use a single line to find the boundary between classes
are weak when there are several very different ways for a record to become
Decision Trees 201

part of the target class. Figure 6.14 illustrates this point using two species of
dinosaur. The decision tree (represented as a box diagram) has successfully
isolated the stegosaurs from the triceratops.
In the credit card industry, for example, there are several ways for customers
to be profitable. Some profitable customers have low transaction rates, but
keep high revolving balances without defaulting. Others pay off their balance
in full each month, but are profitable due to the high transaction volume they
generate. Yet others have few transactions, but occasionally make a large pur-
chase and take several months to pay it off. Two very dissimilar customers
may be equally profitable. A decision tree can find each separate group, label
it, and by providing a description of the box itself, suggest the reason for each
group™s profitability.

Tree Ring Diagrams
Another clever representation of a decision tree is used by the Enterprise
Miner product from SAS Institute. The diagram in Figure 6.15 looks as though
the tree has been cut down and we are looking at the stump.

Figure 6.14 Often a simple line or curve cannot separate the regions and a decision tree
does better.
202 Chapter 6


Figure 6.15 A tree ring diagram produced by SAS Enterprise Miner summarizes the
different levels of the tree.

The circle at the center of the diagram represents the root node, before any
splits have been made. Moving out from the center, each concentric ring rep-
resents a new level in the tree. The ring closest to the center represents the root
node split. The arc length is proportional to the number of records taking each
of the two paths, and the shading represents the node™s purity. The first split in
the model represented by this diagram is fairly unbalanced. It divides the
records into two groups, a large one where the concentration is little different
from the parent population, and a small one with a high concentration of the
target class. At the next level, this smaller node is again split and one branch,
represented by the thin, dark pie slice that extends all the way through to the
outermost ring of the diagram, is a leaf node.
The ring diagram shows the tree™s depth and complexity at a glance and
indicates the location of high concentrations on the target class. What it does
not show directly are the rules defining the nodes. The software reveals these
when a user clicks on a particular section of the diagram.

Decision Trees 203

Decision Trees in Practice
Decision trees can be applied in many different situations.
To explore a large dataset to pick out useful variables

To predict future states of important variables in an industrial process

To form directed clusters of customers for a recommendation system

This section includes examples of decision trees being used in all of these

Decision Trees as a Data Exploration Tool
During the data exploration phase of a data mining project, decision trees are a
useful tool for picking the variables that are likely to be important for predict­
ing particular targets. One of our newspaper clients, The Boston Globe, was inter­
ested in estimating a town™s expected home delivery circulation level based on
various demographic and geographic characteristics. Armed with such esti­
mates, they would, among other things, be able to spot towns with untapped
potential where the actual circulation was lower than the expected circulation.
The final model would be a regression equation based on a handful of vari­
ables. But which variables? And what exactly would the regression attempt to
estimate? Before building the regression model, we used decision trees to help
explore these questions.
Although the newspaper was ultimately interested in predicting the actual
number of subscribing households in a given city or town, that number does
not make a good target for a regression model because towns and cities vary
so much in size. It is not useful to waste modeling power on discovering that
there are more subscribers in large towns than in small ones. A better target is
the penetration”the proportion of households that subscribe to the paper. This
number yields an estimate of the total number of subscribing households sim­
ply by multiply it by the number of households in a town. Factoring out town
size yields a target variable with values that range from zero to somewhat less
than one.
The next step was to figure out which factors, from among the hundreds in
the town signature, separate towns with high penetration (the “good” towns)
from those with low penetration (the “bad” towns). Our approach was to
build decision tree with a binary good/bad target variable. This involved sort­
ing the towns by home delivery penetration and labeling the top one third
“good” and the bottom one third “bad.” Towns in the middle third”those that
are not clearly good or bad”were left out of the training set. The screen shot
in Figure 6.16 shows the top few levels of one of the resulting trees.
204 Chapter 6

Figure 6.16 A decision tree separates good towns from the bad, as visualized by Insightful

The tree shows that median home value is the best first split. Towns where
the median home value (in a region with some of the most expensive housing
in the country) is less than $226,000 dollars are poor prospects for this paper.
The split at the next level is more surprising. The variable chosen for the split
is one of a family of derived variables comparing the subscriber base in the
town to the town population as a whole. Towns where the subscribers are sim­
ilar to the general population are better, in terms of home delivery penetration,
than towns where the subscribers are farther from the mean. Other variables
that were important for distinguishing good from bad towns included the
mean years of school completed, the percentage of the population in blue
collar occupations, and the percentage of the population in high-status occu­
pations. All of these ended up as inputs to the regression model.
Some other variables that we had expected to be important such as distance
from Boston and household income turned out to be less powerful. Once the
decision tree has thrown a spotlight on a variable by either including it or fail­
ing to use it, the reason often becomes clear with a little thought. The problem
with distance from Boston, for instance, is that as one first drives out into the
suburbs, home penetration goes up with distance from Boston. After a while,
however, distance from Boston becomes negatively correlated with penetra­
tion as people far from Boston do not care as much about what goes on there.
Home price is a better predictor because its distribution resembles that of the
target variable, increasing in the first few miles and then declining. The deci­
sion tree provides guidance about which variables to think about as well as
which variables to use.
Decision Trees 205

Applying Decision-Tree Methods to Sequential Events
Predicting the future is one of the most important applications of data mining.
The task of analyzing trends in historical data in order to predict future behav­
ior recurs in every domain we have examined.
One of our clients, a major bank, looked at the detailed transaction data from
its customers in order to spot earlier warning signs for attrition in its checking
accounts. ATM withdrawals, payroll-direct deposits, balance inquiries, visits to
the teller, and hundreds of other transaction types and customer attributes were
tracked over time to find signatures that allow the bank to recognize that a cus-
tomer™s loyalty is beginning to weaken while there is still time to take corrective
Another client, a manufacturer of diesel engines, used the decision tree com­
ponent of SPSS™s Clementine data mining suite to forecast diesel engine sales
based on historical truck registration data. The goal was to identify individual
owner-operators who were likely to be ready to trade in the engines of their
big rigs.
Sales, profits, failure modes, fashion trends, commodity prices, operating
temperatures, interest rates, call volumes, response rates, and return rates: Peo­
ple are trying to predict them all. In some fields, notably economics, the analy­
sis of time-series data is a central preoccupation of statistical analysts, so you
might expect there to be a large collection of ready-made techniques available
to be applied to predictive data mining on time-ordered data. Unfortunately,
this is not the case.
For one thing, much of the time-series analysis work in other fields focuses
on analyzing patterns in a single variable such as the dollar-yen exchange rate
or unemployment in isolation. Corporate data warehouses may well contain
data that exhibits cyclical patterns. Certainly, average daily balances in check­
ing accounts reflect that rents are typically due on the first of the month and
that many people are paid on Fridays, but, for the most part, these sorts of pat­
terns are not of interest because they are neither unexpected nor actionable.
In commercial data mining, our focus is on how a large number of indepen­
dent variables combine to predict some future outcome. Chapter 9 discusses
how time can be integrated into association rules in order to find sequential
patterns. Decision-tree methods have also been applied very successfully in
this domain, but it is generally necessary to enrich the data with trend infor­
mation by including fields such as differences and rates of change that explic­
itly represent change over time. Chapter 17 discusses these data preparation
issues in more detail. The following section describes an application that auto­
matically generates these derived fields and uses them to build a tree-based
simulator that can be used to project an entire database into the future.
206 Chapter 6

Simulating the Future
This discussion is largely based on discussions with Marc Goodman and on


. 43
( 137 .)