<<

. 21
( 137 .)



>>





Step Seven: Build Models
The details of this step vary from technique to technique and are described in
the chapters devoted to each data mining method. In general terms, this is the
step where most of the work of creating a model occurs. In directed data min­
ing, the training set is used to generate an explanation of the independent or
target variable in terms of the independent or input variables. This explana­
tion may take the form of a neural network, a decision tree, a linkage graph, or
some other representation of the relationship between the target and the other
fields in the database. In undirected data mining, there is no target variable.
The model finds relationships between records and expresses them as associa­
tion rules or by assigning them to common clusters.
Building models is the one step of the data mining process that has been
truly automated by modern data mining software. For that reason, it takes up
relatively little of the time in a data mining project.
78 Chapter 3


Step Eight: Assess Models

This step determines whether or not the models are working. A model assess­
ment should answer questions such as:
How accurate is the model?
––

How well does the model describe the observed data?
––

How much confidence can be placed in the model™s predictions?
––

How comprehensible is the model?
––


Of course, the answer to these questions depends on the type of model that
was built. Assessment here refers to the technical merits of the model, rather
than the measurement phase of the virtuous cycle.


Assessing Descriptive Models
The rule, If (state=™MA)™ then heating source is oil, seems more descriptive
than the rule, If (area=339 OR area=351 OR area=413 OR area=508 OR
area=617 OR area=774 OR area=781 OR area=857 OR area=978) then heating
source is oil. Even if the two rules turn out to be equivalent, the first one seems
more expressive.
Expressive power may seem purely subjective, but there is, in fact, a theo­
retical way to measure it, called the minimum description length or MDL. The
minimum description length for a model is the number of bits it takes to
encode both the rule and the list of all exceptions to the rule. The fewer bits
required, the better the rule. Some data mining tools use MDL to decide which
sets of rules to keep and which to weed out.


Assessing Directed Models
Directed models are assessed on their accuracy on previously unseen data.
Different data mining tasks call for different ways of assessing performance of
the model as a whole and different ways of judging the likelihood that the
model yields accurate results for any particular record.
Any model assessment is dependent on context; the same model can look
good according to one measure and bad according to another. In the academic
field of machine learning”the source of many of the algorithms used for data
mining”researchers have a goal of generating models that can be understood
in their entirety. An easy-to-understand model is said to have good “mental
fit.” In the interest of obtaining the best mental fit, these researchers often
prefer models that consist of a few simple rules to models that contain many
such rules, even when the latter are more accurate. In a business setting, such
Data Mining Methodology and Best Practices 79


explicability may not be as important as performance”or may be more
important.
Model assessment can take place at the level of the whole model or at the
level of individual predictions. Two models with the same overall accuracy
may have quite different levels of variance among the individual predictions.
A decision tree, for instance, has an overall classification error rate, but each
branch and leaf of the tree also has an error rate as well.

Assessing Classifiers and Predictors
For classification and prediction tasks, accuracy is measured in terms of the
error rate, the percentage of records classified incorrectly. The classification
error rate on the preclassified test set is used as an estimate of the expected error
rate when classifying new records. Of course, this procedure is only valid if the
test set is representative of the larger population.
Our recommended method of establishing the error rate for a model is to
measure it on a test dataset taken from the same population as the training and
validation sets, but disjointed from them. In the ideal case, such a test set
would be from a more recent time period than the data in the model set; how­
ever, this is not often possible in practice.
A problem with error rate as an assessment tool is that some errors are
worse than others. A familiar example comes from the medical world where a
false negative on a test for a serious disease causes the patient to go untreated
with possibly life-threatening consequences whereas a false positive only
leads to a second (possibly more expensive or more invasive) test. A confusion
matrix or correct classification matrix, shown in Figure 3.11, can be used to sort
out false positives from false negatives. Some data mining tools allow costs to
be associated with each type of misclassification so models can be built to min­
imize the cost rather than the misclassification rate.

Assessing Estimators
For estimation tasks, accuracy is expressed in terms of the difference between
the predicted score and the actual measured result. Both the accuracy of any
one estimate and the accuracy of the model as a whole are of interest. A model
may be quite accurate for some ranges of input values and quite inaccurate for
others. Figure 3.12 shows a linear model that estimates total revenue based on
a product™s unit price. This simple model works reasonably well in one price
range but goes badly wrong when the price reaches the level where the elas­
ticity of demand for the product (the ratio of the percent change in quantity
sold to the percent change in price) is greater than one. An elasticity greater
than one means that any further price increase results in a decrease in revenue
because the increased revenue per unit is more than offset by the drop in the
number of units sold.
80 Chapter 3


Percent of Row Frequency




100

80

60

40

20
Into: WClass
0

1
From: WClass



Percent of Row Frequency
25 100

Figure 3.11 A confusion matrix cross-tabulates predicted outcomes with actual outcomes.
Total Revenue




e
nu
eve
dR
ate
tim
Es




Unit Price
Figure 3.12 The accuracy of an estimator may vary considerably over the range of inputs.
Data Mining Methodology and Best Practices 81


The standard way of describing the accuracy of an estimation model is by
measuring how far off the estimates are on average. But, simply subtracting the
estimated value from the true value at each point and taking the mean results
in a meaningless number. To see why, consider the estimates in Table 3.1.
The average difference between the true values and the estimates is zero;
positive differences and negative differences have canceled each other out.
The usual way of solving this problem is to sum the squares of the differences
rather than the differences themselves. The average of the squared differences
is called the variance. The estimates in this table have a variance of 10.

(-52 + 22 + -22 + 12 + 42 )/5 = (25 + 4 + 4 + 1 + 16)/5 = 50/5 = 10

The smaller the variance, the more accurate the estimate. A drawback to vari­
ance as a measure is that it is not expressed in the same units as the estimates
themselves. For estimated prices in dollars, it is more useful to know how far off
the estimates are in dollars rather than square dollars! For that reason, it is usual
to take the square root of the variance to get a measure called the standard devia­
tion. The standard deviation of these estimates is the square root of 10 or about
3.16. For our purposes, all you need to know about the standard deviation is that
it is a measure of how widely the estimated values vary from the true values.


Comparing Models Using Lift
Directed models, whether created using neural networks, decision trees,
genetic algorithms, or Ouija boards, are all created to accomplish some task.
Why not judge them on their ability to classify, estimate, and predict? The
most common way to compare the performance of classification models is to
use a ratio called lift. This measure can be adapted to compare models
designed for other tasks as well. What lift actually measures is the change in
concentration of a particular class when the model is used to select a group
from the general population.

lift = P(classt| sample) / P(classt | population)




Table 3.1 Countervailing Errors

TRUE VALUE ESTIMATED VALUE ERROR

127 132 -5
78 76 2

120 122 -2

130 129 1

95 91 4
82 Chapter 3


An example helps to explain this. Suppose that we are building a model to
predict who is likely to respond to a direct mail solicitation. As usual, we build
the model using a preclassified training dataset and, if necessary, a preclassi­
fied validation set as well. Now we are ready to use the test set to calculate the
model™s lift.
The classifier scores the records in the test set as either “predicted to respond”
or “not predicted to respond.” Of course, it is not correct every time, but if the
model is any good at all, the group of records marked “predicted to respond”
contains a higher proportion of actual responders than the test set as a whole.
Consider these records. If the test set contains 5 percent actual responders and
the sample contains 50 percent actual responders, the model provides a lift of 10
(50 divided by 5).
Is the model that produces the highest lift necessarily the best model? Surely




Y
a list of people half of whom will respond is preferable to a list where only a




FL
quarter will respond, right? Not necessarily”not if the first list has only 10
names on it!
The point is that lift is a function of sample size. If the classifier only picks
AM
out 10 likely respondents, and it is right 100 percent of the time, it will achieve
a lift of 20”the highest lift possible when the population contains 5 percent
responders. As the confidence level required to classify someone as likely to
TE

respond is relaxed, the mailing list gets longer, and the lift decreases.
Charts like the one in Figure 3.13 will become very familiar as you work
with data mining tools. It is created by sorting all the prospects according to
their likelihood of responding as predicted by the model. As the size of the
mailing list increases, we reach farther and farther down the list. The X-axis
shows the percentage of the population getting our mailing. The Y-axis shows
the percentage of all responders we reach.
If no model were used, mailing to 10 percent of the population would reach
10 percent of the responders, mailing to 50 percent of the population would
reach 50 percent of the responders, and mailing to everyone would reach all
the responders. This mass-mailing approach is illustrated by the line slanting
upwards. The other curve shows what happens if the model is used to select
recipients for the mailing. The model finds 20 percent of the responders by
mailing to only 10 percent of the population. Soliciting half the population
reaches over 70 percent of the responders.
Charts like the one in Figure 3.13 are often referred to as lift charts, although
what is really being graphed is cumulative response or concentration. Figure
3.13 shows the actual lift chart corresponding to the response chart in Figure
3.14. The chart shows clearly that lift decreases as the size of the target list
increases.

<<

. 21
( 137 .)



>>