Step Seven: Build Models

The details of this step vary from technique to technique and are described in

the chapters devoted to each data mining method. In general terms, this is the

step where most of the work of creating a model occurs. In directed data min

ing, the training set is used to generate an explanation of the independent or

target variable in terms of the independent or input variables. This explana

tion may take the form of a neural network, a decision tree, a linkage graph, or

some other representation of the relationship between the target and the other

fields in the database. In undirected data mining, there is no target variable.

The model finds relationships between records and expresses them as associa

tion rules or by assigning them to common clusters.

Building models is the one step of the data mining process that has been

truly automated by modern data mining software. For that reason, it takes up

relatively little of the time in a data mining project.

78 Chapter 3

Step Eight: Assess Models

This step determines whether or not the models are working. A model assess

ment should answer questions such as:

How accurate is the model?

––

How well does the model describe the observed data?

––

How much confidence can be placed in the model™s predictions?

––

How comprehensible is the model?

––

Of course, the answer to these questions depends on the type of model that

was built. Assessment here refers to the technical merits of the model, rather

than the measurement phase of the virtuous cycle.

Assessing Descriptive Models

The rule, If (state=™MA)™ then heating source is oil, seems more descriptive

than the rule, If (area=339 OR area=351 OR area=413 OR area=508 OR

area=617 OR area=774 OR area=781 OR area=857 OR area=978) then heating

source is oil. Even if the two rules turn out to be equivalent, the first one seems

more expressive.

Expressive power may seem purely subjective, but there is, in fact, a theo

retical way to measure it, called the minimum description length or MDL. The

minimum description length for a model is the number of bits it takes to

encode both the rule and the list of all exceptions to the rule. The fewer bits

required, the better the rule. Some data mining tools use MDL to decide which

sets of rules to keep and which to weed out.

Assessing Directed Models

Directed models are assessed on their accuracy on previously unseen data.

Different data mining tasks call for different ways of assessing performance of

the model as a whole and different ways of judging the likelihood that the

model yields accurate results for any particular record.

Any model assessment is dependent on context; the same model can look

good according to one measure and bad according to another. In the academic

field of machine learning”the source of many of the algorithms used for data

mining”researchers have a goal of generating models that can be understood

in their entirety. An easy-to-understand model is said to have good “mental

fit.” In the interest of obtaining the best mental fit, these researchers often

prefer models that consist of a few simple rules to models that contain many

such rules, even when the latter are more accurate. In a business setting, such

Data Mining Methodology and Best Practices 79

explicability may not be as important as performance”or may be more

important.

Model assessment can take place at the level of the whole model or at the

level of individual predictions. Two models with the same overall accuracy

may have quite different levels of variance among the individual predictions.

A decision tree, for instance, has an overall classification error rate, but each

branch and leaf of the tree also has an error rate as well.

Assessing Classifiers and Predictors

For classification and prediction tasks, accuracy is measured in terms of the

error rate, the percentage of records classified incorrectly. The classification

error rate on the preclassified test set is used as an estimate of the expected error

rate when classifying new records. Of course, this procedure is only valid if the

test set is representative of the larger population.

Our recommended method of establishing the error rate for a model is to

measure it on a test dataset taken from the same population as the training and

validation sets, but disjointed from them. In the ideal case, such a test set

would be from a more recent time period than the data in the model set; how

ever, this is not often possible in practice.

A problem with error rate as an assessment tool is that some errors are

worse than others. A familiar example comes from the medical world where a

false negative on a test for a serious disease causes the patient to go untreated

with possibly life-threatening consequences whereas a false positive only

leads to a second (possibly more expensive or more invasive) test. A confusion

matrix or correct classification matrix, shown in Figure 3.11, can be used to sort

out false positives from false negatives. Some data mining tools allow costs to

be associated with each type of misclassification so models can be built to min

imize the cost rather than the misclassification rate.

Assessing Estimators

For estimation tasks, accuracy is expressed in terms of the difference between

the predicted score and the actual measured result. Both the accuracy of any

one estimate and the accuracy of the model as a whole are of interest. A model

may be quite accurate for some ranges of input values and quite inaccurate for

others. Figure 3.12 shows a linear model that estimates total revenue based on

a product™s unit price. This simple model works reasonably well in one price

range but goes badly wrong when the price reaches the level where the elas

ticity of demand for the product (the ratio of the percent change in quantity

sold to the percent change in price) is greater than one. An elasticity greater

than one means that any further price increase results in a decrease in revenue

because the increased revenue per unit is more than offset by the drop in the

number of units sold.

80 Chapter 3

Percent of Row Frequency

100

80

60

40

20

Into: WClass

0

1

From: WClass

Percent of Row Frequency

25 100

Figure 3.11 A confusion matrix cross-tabulates predicted outcomes with actual outcomes.

Total Revenue

e

nu

eve

dR

ate

tim

Es

Unit Price

Figure 3.12 The accuracy of an estimator may vary considerably over the range of inputs.

Data Mining Methodology and Best Practices 81

The standard way of describing the accuracy of an estimation model is by

measuring how far off the estimates are on average. But, simply subtracting the

estimated value from the true value at each point and taking the mean results

in a meaningless number. To see why, consider the estimates in Table 3.1.

The average difference between the true values and the estimates is zero;

positive differences and negative differences have canceled each other out.

The usual way of solving this problem is to sum the squares of the differences

rather than the differences themselves. The average of the squared differences

is called the variance. The estimates in this table have a variance of 10.

(-52 + 22 + -22 + 12 + 42 )/5 = (25 + 4 + 4 + 1 + 16)/5 = 50/5 = 10

The smaller the variance, the more accurate the estimate. A drawback to vari

ance as a measure is that it is not expressed in the same units as the estimates

themselves. For estimated prices in dollars, it is more useful to know how far off

the estimates are in dollars rather than square dollars! For that reason, it is usual

to take the square root of the variance to get a measure called the standard devia

tion. The standard deviation of these estimates is the square root of 10 or about

3.16. For our purposes, all you need to know about the standard deviation is that

it is a measure of how widely the estimated values vary from the true values.

Comparing Models Using Lift

Directed models, whether created using neural networks, decision trees,

genetic algorithms, or Ouija boards, are all created to accomplish some task.

Why not judge them on their ability to classify, estimate, and predict? The

most common way to compare the performance of classification models is to

use a ratio called lift. This measure can be adapted to compare models

designed for other tasks as well. What lift actually measures is the change in

concentration of a particular class when the model is used to select a group

from the general population.

lift = P(classt| sample) / P(classt | population)

Table 3.1 Countervailing Errors

TRUE VALUE ESTIMATED VALUE ERROR

127 132 -5

78 76 2

120 122 -2

130 129 1

95 91 4

82 Chapter 3

An example helps to explain this. Suppose that we are building a model to

predict who is likely to respond to a direct mail solicitation. As usual, we build

the model using a preclassified training dataset and, if necessary, a preclassi

fied validation set as well. Now we are ready to use the test set to calculate the

model™s lift.

The classifier scores the records in the test set as either “predicted to respond”

or “not predicted to respond.” Of course, it is not correct every time, but if the

model is any good at all, the group of records marked “predicted to respond”

contains a higher proportion of actual responders than the test set as a whole.

Consider these records. If the test set contains 5 percent actual responders and

the sample contains 50 percent actual responders, the model provides a lift of 10

(50 divided by 5).

Is the model that produces the highest lift necessarily the best model? Surely

Y

a list of people half of whom will respond is preferable to a list where only a

FL

quarter will respond, right? Not necessarily”not if the first list has only 10

names on it!

The point is that lift is a function of sample size. If the classifier only picks

AM

out 10 likely respondents, and it is right 100 percent of the time, it will achieve

a lift of 20”the highest lift possible when the population contains 5 percent

responders. As the confidence level required to classify someone as likely to

TE

respond is relaxed, the mailing list gets longer, and the lift decreases.

Charts like the one in Figure 3.13 will become very familiar as you work

with data mining tools. It is created by sorting all the prospects according to

their likelihood of responding as predicted by the model. As the size of the

mailing list increases, we reach farther and farther down the list. The X-axis

shows the percentage of the population getting our mailing. The Y-axis shows

the percentage of all responders we reach.

If no model were used, mailing to 10 percent of the population would reach

10 percent of the responders, mailing to 50 percent of the population would

reach 50 percent of the responders, and mailing to everyone would reach all

the responders. This mass-mailing approach is illustrated by the line slanting

upwards. The other curve shows what happens if the model is used to select

recipients for the mailing. The model finds 20 percent of the responders by

mailing to only 10 percent of the population. Soliciting half the population

reaches over 70 percent of the responders.

Charts like the one in Figure 3.13 are often referred to as lift charts, although

what is really being graphed is cumulative response or concentration. Figure

3.13 shows the actual lift chart corresponding to the response chart in Figure

3.14. The chart shows clearly that lift decreases as the size of the target list

increases.