. 44
( 137 .)


his 1995 doctoral dissertation on a technique called projective visualization. Pro­
jective visualization uses a database of snapshots of historical data to develop
a simulator. The simulation can be run to project the values of all variables into
the future. The result is an extended database whose new records have exactly
the same fields as the original, but with values supplied by the simulator
rather than by observation. The approach is described in more detail in the
technical aside.

Case Study: Process Control in a Coffee-Roasting Plant
Nestl©, one of the largest food and beverages companies in the world, used a
number of continuous-feed coffee roasters to produce a variety of coffee prod­
ucts including Nescaf© Granules, Gold Blend, Gold Blend Decaf, and Blend 37.
Each of these products has a “recipe” that specifies target values for a plethora
of roaster variables such as the temperature of the air at various exhaust
points, the speed of various fans, the rate that gas is burned, the amount of
water introduced to quench the beans, and the positions of various flaps and
valves. There are a lot of ways for things to go wrong when roasting coffee,
ranging from a roast coming out too light in color to a costly and damaging
roaster fire. A bad batch of roasted coffee incurs a big cost; damage to equip­
ment is even more expensive.
To help operators keep the roaster running properly, data is collected from
about 60 sensors. Every 30 seconds, this data, along with control information,
is written to a log and made available to operators in the form of graphs. The
project described here took place at a Nestl© research laboratory in York,
England. Nestl© used projective visualization to build a coffee roaster simula­
tion based on the sensor logs.

Goals for the Simulator
Nestl© saw several ways that a coffee roaster simulator could improve its
By using the simulator to try out new recipes, a large number of new

recipes could be evaluated without interrupting production. Further­
more, recipes that might lead to roaster fires or other damage could be
eliminated in advance.
The simulator could be used to train new operators and expose them to

routine problems and their solutions. Using the simulator, operators
could try out different approaches to resolving a problem.
Decision Trees 207


Using Goodman™s terminology, which comes from the machine learning field,
each snapshot of a moment in time is called a case. A case is made up of
attributes, which are the fields in the case record. Attributes may be of any data
type and may be continuous or categorical. The attributes are used to form
features. Features are Boolean (yes/no) variables that are combined in various
ways to form the internal nodes of a decision tree. For example, if the database
contains a numeric salary field, a continuous attribute, then that might lead to
creation of a feature such as salary < 38,500.
For a continuous variable like salary, a feature of the form attribute ¤ value is
generated for every value observed in the training set. This means that there
are potentially as many features derived from an attribute as there are cases in
the training set. Features based on equality or set membership are generated
for symbolic attributes and literal attributes such as names of people or places.
The attributes are also used to generate interpretations; these are new
attributes derived from the given ones. Interpretations generally reflect knowl­
edge of the domain and what sorts of relationships are likely to be important.
In the current problem, finding patterns that occur over time, the amount,
direction, and rate of change in the value of an attribute from one time period
to the next are likely to be important. Therefore, for each numeric attribute, the
software automatically generates interpretations for the difference and the
discrete first and second derivatives of the attribute.
In general, however, the user supplies interpretations. For example, in a
credit risk model, it is likely that the ratio of debt to income is more predictive
than the magnitude of either. With this knowledge we might add an inter­
pretation that was the ratio of those two attributes. Often, user-supplied inter­
pretations combine attributes in ways that the program would not come up
with automatically. Examples include calculating a great-circle distance from
changes in latitude and longitude or taking the product of three linear
measurements to get a volume.
The central idea behind projective visualization is to use the historical cases to
generate a set of rules for generating case n+1 from case n. When this model is
applied to the final observed case, it generates a new projected case. To project
more than one time step into the future, we continue to apply the model to the
most recently created case. Naturally, confidence in the projected values de­
creases as the simulation is run for more and more time steps.
The figure illustrates the way a single attribute is projected using a decision
tree based on the features generated from all the other attributes and
interpretations in the previous case. During the training process, a separate
decision tree is grown for each attribute. This entire forest is evaluated in order
to move from one simulation step to the next.
208 Chapter 6


field field

field field

field field

No No

field field

No Yes
field field

field field

field field

One snapshot uses decision trees to create the next snapshot in time.

The simulator could track the operation of the actual roaster and project

it several minutes into the future. When the simulation ran into a prob­
lem, an alert could be generated while the operators still had time to
avert trouble.

Evaluation of the Roaster Simulation
The simulation was built using a training set of 34,000 cases. The simulation
was then evaluated using a test set of around 40,000 additional cases that had
not been part of the training set. For each case in the test set, the simulator gen­
erated projected snapshots 60 steps into the future. At each step the projected
values of all variables were compared against the actual values. As expected,
the size of the error increases with time. For example, the error rate for prod­
uct temperature turned out to be 2/3°C per minute of projection, but even 30
minutes into the future the simulator is doing considerably better than ran­
dom guessing.
The roaster simulator turned out to be more accurate than all but the most
experienced operators at projecting trends, and even the most experienced
operators were able to do a better job with the aid of the simulator. Operators
Decision Trees 209

enjoyed using the simulator and reported that it gave them new insight into
corrective actions.

Lessons Learned
Decision-tree methods have wide applicability for data exploration, classifica­
tion, and scoring. They can also be used for estimating continuous values
although they are rarely the first choice since decision trees generate “lumpy”
estimates”all records reaching the same leaf are assigned the same estimated
value. They are a good choice when the data mining task is classification of
records or prediction of discrete outcomes. Use decision trees when your goal
is to assign each record to one of a few broad categories. Theoretically, decision
trees can assign records to an arbitrary number of classes, but they are error-
prone when the number of training examples per class gets small. This can
happen rather quickly in a tree with many levels and/or many branches per
node. In many business contexts, problems naturally resolve to a binary
classification such as responder/nonresponder or good/bad so this is not a
large problem in practice.
Decision trees are also a natural choice when the goal is to generate under­
standable and explainable rules. The ability of decision trees to generate rules
that can be translated into comprehensible natural language or SQL is one of
the greatest strengths of the technique. Even in complex decision trees , it is
generally fairly easy to follow any one path through the tree to a particular
leaf. So the explanation for any particular classification or prediction is rela­
tively straightforward.
Decision trees require less data preparation than many other techniques
because they are equally adept at handling continuous and categorical vari­
ables. Categorical variables, which pose problems for neural networks and sta­
tistical techniques, are split by forming groups of classes. Continuous variables
are split by dividing their range of values. Because decision trees do not make
use of the actual values of numeric variables, they are not sensitive to outliers
and skewed distributions. This robustness comes at the cost of throwing away
some of the information that is available in the training data, so a well-tuned
neural network or regression model will often make better use of the same
fields than a decision tree. For that reason, decision trees are often used to pick
a good set of variables to be used as inputs to another modeling technique.
Time-oriented data does require a lot of data preparation. Time series data must
be enhanced so that trends and sequential patterns are made visible.
Decision trees reveal so much about the data to which they are applied
that the authors make use of them in the early phases of nearly every data
mining project even when the final models are to be created using some other


Artificial Neural Networks

Artificial neural networks are popular because they have a proven track record
in many data mining and decision-support applications. Neural networks”
the “artificial” is usually dropped”are a class of powerful, general-purpose
tools readily applied to prediction, classification, and clustering. They have
been applied across a broad range of industries, from predicting time series in
the financial world to diagnosing medical conditions, from identifying clus­
ters of valuable customers to identifying fraudulent credit card transactions,
from recognizing numbers written on checks to predicting the failure rates of
The most powerful neural networks are, of course, the biological kind. The
human brain makes it possible for people to generalize from experience; com­
puters, on the other hand, usually excel at following explicit instructions over
and over. The appeal of neural networks is that they bridge this gap by mod­
eling, on a digital computer, the neural connections in human brains. When
used in well-defined domains, their ability to generalize and learn from data
mimics, in some sense, our own ability to learn from experience. This ability is
useful for data mining, and it also makes neural networks an exciting area for
research, promising new and better results in the future.
There is a drawback, though. The results of training a neural network are
internal weights distributed throughout the network. These weights provide
no more insight into why the solution is valid than dissecting a human brain
explains our thought processes. Perhaps one day, sophisticated techniques for

212 Chapter 7

probing neural networks may help provide some explanation. In the mean­
time, neural networks are best approached as black boxes with internal work­
ings as mysterious as the workings of our brains. Like the responses of the
Oracle at Delphi worshipped by the ancient Greeks, the answers produced by
neural networks are often correct. They have business value”in many cases a
more important feature than providing an explanation.
This chapter starts with a bit of history; the origins of neural networks grew
out of actual attempts to model the human brain on computers. It then dis­
cusses an early case history of using this technique for real estate appraisal,
before diving into technical details. Most of the chapter presents neural net­
works as predictive modeling tools. At the end, we see how they can be used
for undirected data mining as well. A good place to begin is, as always, at the
beginning, with a bit of history.

A Bit of History
Neural networks have an interesting history in the annals of computer science.
The original work on the functioning of neurons”biological neurons”took
place in the 1930s and 1940s, before digital computers really even existed. In

1943, Warren McCulloch, a neurophysiologist at Yale University, and Walter
Pitts, a logician, postulated a simple model to explain how biological neurons
work and published it in a paper called “A Logical Calculus Immanent in
Nervous Activity.” While their focus was on understanding the anatomy of the
brain, it turned out that this model provided inspiration for the field of artifi­
cial intelligence and would eventually provide a new approach to solving cer­
tain problems outside the realm of neurobiology.
In the 1950s, when digital computers first became available, computer
scientists implemented models called perceptrons based on the work of
McCulloch and Pitts. An example of a problem solved by these early networks
was how to balance a broom standing upright on a moving cart by controlling
the motions of the cart back and forth. As the broom starts falling to the left,
the cart learns to move to the left to keep it upright. Although there were some
limited successes with perceptrons in the laboratory, the results were disap­


. 44
( 137 .)