. 36
( 137 .)


square calculation for this cell is the same as the chi-square for the cell in the
original calculation, although the other values do not match anything. This
makes it unnecessary to do additional calculations.
This means that an estimate of the effect of each combination of variables
can be obtained using the chi-square value in the cell with a degree of freedom
of 1. The result is a table that has a set of p-values that a given square is caused
by chance, as shown in Table 5.10.
However, there is a second correction that needs to be made because there
are many comparisons taking place at the same time. Bonferroni™s adjustment
takes care of this by multiplying each p-value by the number of comparisons”
which is the number of cells in the table. For final presentation purposes, con­
vert the p-values to their opposite, the confidence and multiply by the sign of
the deviation to get a signed confidence. Figure 5.10 illustrates the result.

Table 5.10 Estimated P-Value for Each Combination of County and Channel, without
Correcting for Number of Comparisons


BRONX 0.00% 0.00% 0.00%

KINGS 0.00% 0.00% 0.00%

NASSAU 0.00% 0.00% 0.00%

NEW YORK 0.00% 0.00% 0.00%

QUEENS 0.00% 0.74% 0.00%

RICHMOND 59.79% 0.07% 39.45%

SUFFOLK 0.01% 0.00% 42.91%

WESTCHESTER 0.00% 0.00% 0.00%
158 Chapter 5








Figure 5.10 This chart shows the signed confidence values for each county and region
combination; the preponderance of values near 100% and “100% indicate that observed
differences are statistically significant.

The result is interesting. First, almost all the values are near 100 percent or
“100 percent, meaning that there are statistically significant differences among
the counties. In fact, telemarketing (the diamond) and direct mail (the square)
are always at opposite ends. There is a direct inverse relationship between the
two. Direct mail is high and telemarketing low in three counties”Manhattan,
Nassau, and Suffolk. There are many wealthy areas in these counties, suggest­
ing that wealthy customers are more likely to respond to direct mail than tele­
marketing. Of course, this could also mean that direct mail campaigns are
directed to these areas, and telemarketing to other areas, so the geography was
determined by the business operations. To determine which of these possibili­
ties is correct, we would need to know who was contacted as well as who

Data Mining and Statistics
Many of the data mining techniques discussed in the next eight chapters
were invented by statisticians or have now been integrated into statistical soft­
ware; they are extensions of standard statistics. Although data miners and
The Lure of Statistics: Data Mining Using Familiar Tools 159

statisticians use similar techniques to solve similar problems, the data mining
approach differs from the standard statistical approach in several areas:
Data miners tend to ignore measurement error in raw data.

Data miners assume that there is more than enough data and process­

ing power.
Data mining assumes dependency on time everywhere.

It can be hard to design experiments in the business world.

Data is truncated and censored.

These are differences of approach, rather than opposites. As such, they shed
some light on how the business problems addressed by data miners differ
from the scientific problems that spurred the development of statistics.

No Measurement Error in Basic Data
Statistics originally derived from measuring scientific quantities, such as the
width of a skull or the brightness of a star. These measurements are quantita­
tive and the precise measured value depends on factors such as the type of
measuring device and the ambient temperature. In particular, two people tak­
ing the same measurement at the same time are going to produce slightly dif­
ferent results. The results might differ by 5 percent or 0.05 percent, but there is
a difference. Traditionally, statistics looks at observed values as falling into a
confidence interval.
On the other hand, the amount of money a customer paid last January is
quite well understood”down to the last penny. The definition of customer
may be a little bit fuzzy; the definition of January may be fuzzy (consider 5-4-
4 accounting cycles). However, the amount of the payment is precise. There is
no measurement error.
There are sources of error in business data. Of particular concern is opera­
tional error, which can cause systematic bias in what is being collected. For
instance, clock skew may mean that two events that seem to happen in one
sequence may happen in another. A database record may have a Tuesday update
date, when it really was updated on Monday, because the updating process runs
just after midnight. Such forms of bias are systematic, and potentially represent
spurious patterns that might be picked up by data mining algorithms.
One major difference between business data and scientific data is that the
latter has many continuous values and the former has many discrete values.
Even monetary amounts are discrete”two values can differ only by multiples
of pennies (or some similar amount)”even though the values might be repre­
sented by real numbers.
160 Chapter 5

There Is a Lot of Data
Traditionally, statistics has been applied to smallish data sets (at most a few
thousand rows) with few columns (less than a dozen). The goal has been to
squeeze as much information as possible out of the data. This is still important
in problems where collecting data is expensive or arduous”such as market
research, crash testing cars, or tests of the chemical composition of Martian soil.
Business data, on the other hand, is very voluminous. The challenge is
understanding anything about what is happening, rather than every possible
thing. Fortunately, there is also enough computing power available to handle
the large volumes of data.
Sampling theory is an important part of statistics. This area explains how
results on a subset of data (a sample) relate to the whole. This is very important
when planning to do a poll, because it is not possible to ask everyone a ques­
tion; rather, pollsters ask a very small sample and derive overall opinion.
However, this is much less important when all the data is available. Usually, it
is best to use all the data available, rather than a small subset of it.
There are a few cases when this is not necessarily true. There might simply
be too much data. Instead of building models on tens of millions of customers;
build models on hundreds of thousands”at least to learn how to build better
models. Another reason is to get an unrepresentative sample. Such a sample, for
instance, might have an equal number of churners and nonchurners, although
the original data had different proportions. However, it is generally better to
use more data rather than sample down and use less, unless there is a good
reason for sampling down.

Time Dependency Pops Up Everywhere
Almost all data used in data mining has a time dependency associated with it.
Customers™ reactions to marketing efforts change over time. Prospects™ reac­
tions to competitive offers change over time. Comparing results from a mar­
keting campaign one year to the previous year is rarely going to yield exactly
the same result. We do not expect the same results.
On the other hand, we do expect scientific experiments to yield similar results
regardless of when the experiment takes place. The laws of science are consid­
ered immutable; they do not change over time. By contrast, the business climate
changes daily. Statistics often considers repeated observations to be indepen­
dent observations. That is, one observation does not resemble another. Data
mining, on the other hand, must often consider the time component of the data.

Experimentation is Hard
Data mining has to work within the constraints of existing business practices.
This can make it difficult to set up experiments, for several reasons:
The Lure of Statistics: Data Mining Using Familiar Tools 161

Businesses may not be willing to invest in efforts that reduce short-term

gain for long-term learning.
Business processes may interfere with well-designed experimental

Factors that may affect the outcome of the experiment may not be

Timing plays a critical role and may render results useless.

Of these, the first two are the most difficult. The first simply says that tests
do not get done. Or, they are done so poorly that the results are useless. The
second poses the problem that a seemingly well-designed experiment may not
be executed correctly. There are always hitches when planning a test; some­
times these hitches make it impossible to read the results.

Data Is Censored and Truncated
The data used for data mining is often incomplete, in one of two special ways.
Censored values are incomplete because whatever is being measured is not
complete. One example is customer tenures. For active customers, we know
the tenure is greater than the current tenure; however, we do not know which
customers are going to stop tomorrow and which are going to stop 10 years
from now. The actual tenure is greater than the observed value and cannot be
known until the customer actually stops at some unknown point in the future.




Inventory Units



Lost Sales
0 5 10 15 20 25 30 35 40


Figure 5.11 A time series of product sales and inventory illustrates the problem of


. 36
( 137 .)