. 32
( 137 .)



















Figure 5.5 Based on the same data from Figures 5.2 and 5.3, this chart shows the
signed confidence (q-values) of the observed value based on the average and standard
deviation. This sign is positive when the observed value is too high, negative when it is too
136 Chapter 5

Time series are an example of cross-tabulation”looking at the values of two or
more variables at one time. For time series, the second variable is the time
something occurred.
Table 5.1 shows an example used later in this chapter. The cross-tabulation
shows the number of new customers from counties in southeastern New York
state by three channels: telemarketing, direct mail, and other. This table shows
both the raw counts and the relative frequencies.
It is possible to visualize cross-tabulations as well. However, there is a lot of
data being presented, and some people do not follow complicated pictures.
Figure 5.6 shows a surface plot for the counts shown in the table. A surface plot
often looks a bit like hilly terrain. The counts are the height of the hills; the
counties go up one side and the channels make the third dimension. This sur­
face plot shows that the other channel is quite high for Manhattan (New York
county). Although not a problem in this case, such peaks can hide other hills
and valleys on the surface plot.

Looking at Continuous Variables
Statistics originated to understand the data collected by scientists, most of
which took the form of continuous measurements. In data mining, we
encounter continuous data less often, because there is a wealth of descriptive
data as well. This section talks about continuous data from the perspective of
descriptive statistics.

Table 5.1 Cross-tabulation of Starts by County and Channel

BRONX 3,212 413 2,936 6,561 2.5% 0.3% 2.3% 5.1%

KINGS 9,773 1,393 11,025 22,191 7.7% 1.1% 8.6% 17.4%

NASSAU 3,135 1,573 10,367 15,075 2.5% 1.2% 8.1% 11.8%

NEW YORK 7,194 2,867 28,965 39,026 5.6% 2.2% 22.7% 30.6%

QUEENS 6,266 1,380 10,954 18,600 4.9% 1.1% 8.6% 14.6%

RICHMOND 784 277 1,772 2,833 0.6% 0.2% 1.4% 2.2%

SUFFOLK 2,911 1,042 7,159 11,112 2.3% 0.8% 5.6% 8.7%
WESTCHESTER 2,711 1,230 8,271 12,212 2.1% 1.0% 6.5% 9.6%
TOTAL 35,986 10,175 81,449 127,610 28.2% 8.0% 63.8% 100.0%
The Lure of Statistics: Data Mining Using Familiar Tools 137










Figure 5.6 A surface plot provides a visual interface for cross-tabulated data.

Statistical Measures for Continuous Variables
The most basic statistical measures describe a set of data with just a single
value. The most commonly used statistic is the mean or average value (the sum
of all the values divided by the number of them). Some other important things
to look at are:
Range. The range is the difference between the smallest and largest obser­
vation in the sample. The range is often looked at along with the mini­
mum and maximum values themselves.
Mean. This is what is called an average in everyday speech.
Median. The median value is the one which splits the observations into
two equally sized groups, one having observations smaller than the
median and another containing observations larger than the median.
Mode. This is the value that occurs most often.
The median can be used in some situations where it is impossible to calcu­
late the mean, such as when incomes are reported in ranges of $10,000 dollars
with a final category “over $100,000.” The number of observations are known
in each group, but not the actual values. In addition, the median is less affected
by a few observations that are out of line with the others. For instance, if Bill
Gates moves onto your block, the average net worth of the neighborhood will
dramatically increase. However, the median net worth may not change at all.
138 Chapter 5

In addition, various ways of characterizing the range are useful. The range
itself is defined by the minimum and maximum value. It is often worth looking
at percentile information, such as the 25th and 75th percentile, to understand the
limits of the middle half of the values as well.
Figure 5.7 shows a chart where the range and average are displayed for order
amount by day. This chart uses a logarithmic (log) scale for the vertical axis,
because the minimum order is under $10 and the maximum over $1,000. In fact,
the minimum is consistently around $10, the average around $70, and the max­
imum around $1,000. As with discrete variables, it is valuable to use a time
chart for continuous values to see when unexpected things are happening.

Variance and Standard Deviation
Variance is a measure of the dispersion of a sample or how closely the obser­
vations cluster around the average. The range is not a good measure of
dispersion because it takes only two values into account”the extremes.
Removing one extreme can, sometimes, dramatically change the range. The
variance, on the other hand, takes every value into account. The difference
between a given observation and the mean of the sample is called its deviation.
The variance is defined as the average of the squares of the deviations.
Standard deviation, the square root of the variance, is the most frequently
used measure of dispersion. It is more convenient than variance because it is
expressed in the same units as the observations rather than in terms of those
units squared. This allows the standard deviation itself to be used as a unit of
measurement. The z-score, which we used earlier, is an observation™s distance
from the mean measured in standard deviations. Using the normal distribu­
tion, the z-score can be converted to a probability or confidence level.

Order Amount (Log Scale)

$1,000 Max Order


Min Order

Jan Feb Mar Apr May Jun Jul

Figure 5.7 A time chart can also be used for continuous values; this one shows the range
and average for order amounts each day.
The Lure of Statistics: Data Mining Using Familiar Tools 139

A Couple More Statistical Ideas
Correlation is a measure of the extent to which a change in one variable is
related to a change in another. Correlation ranges from “1 to 1. A correlation of
0 means that the two variables are not related. A correlation of 1 means that as
the first variable changes, the second is guaranteed to change in the same
direction, though not necessarily by the same amount. Another measure of
correlation is the R2 value, which is the correlation squared and goes from 0
(no relationship) to 1 (complete relationship). For instance, the radius and the
circumference of a circle are perfectly correlated, although the latter grows
faster than the former. A negative correlation means that the two variables
move in opposite directions. For example, altitude is negatively correlated to
air pressure.
Regression is the process of using the value of one of a pair of correlated vari­
ables in order to predict the value of the second. The most common form of
regression is linear regression, so called because it attempts to fit a straight line
through the observed X and Y pairs in a sample. Once the line has been estab­
lished, it can be used to predict a value for Y given any X and for X given any Y.

Measuring Response
This section looks at statistical ideas in the context of a marketing campaign.
The champion-challenger approach to marketing tries out different ideas
against the business as usual. For instance, assume that a company sends out
a million billing inserts each month to entice customers to do something. They
have settled on one approach to the bill inserts, which is the champion offer.
Another offer is a challenger to this offer. Their approach to comparing these is:
Send the champion offer to 900,000 customers.

Send the challenger offer to 100,000 customers.

Determine which is better.

The question is, how do we know when one offer is better than another? This
section introduces the ideas of confidence to understand this in more detail.

Standard Error of a Proportion
The approach to answering this question uses the idea of a confidence interval.


. 32
( 137 .)