-50%

-75%

-100%

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Jan

Feb

Mar

Apr

May

Jun

Figure 5.5 Based on the same data from Figures 5.2 and 5.3, this chart shows the

signed confidence (q-values) of the observed value based on the average and standard

deviation. This sign is positive when the observed value is too high, negative when it is too

low.

136 Chapter 5

Cross-Tabulations

Time series are an example of cross-tabulation”looking at the values of two or

more variables at one time. For time series, the second variable is the time

something occurred.

Table 5.1 shows an example used later in this chapter. The cross-tabulation

shows the number of new customers from counties in southeastern New York

state by three channels: telemarketing, direct mail, and other. This table shows

both the raw counts and the relative frequencies.

It is possible to visualize cross-tabulations as well. However, there is a lot of

data being presented, and some people do not follow complicated pictures.

Figure 5.6 shows a surface plot for the counts shown in the table. A surface plot

often looks a bit like hilly terrain. The counts are the height of the hills; the

counties go up one side and the channels make the third dimension. This sur

face plot shows that the other channel is quite high for Manhattan (New York

county). Although not a problem in this case, such peaks can hide other hills

and valleys on the surface plot.

Looking at Continuous Variables

Statistics originated to understand the data collected by scientists, most of

which took the form of continuous measurements. In data mining, we

encounter continuous data less often, because there is a wealth of descriptive

data as well. This section talks about continuous data from the perspective of

descriptive statistics.

Table 5.1 Cross-tabulation of Starts by County and Channel

COUNTS FREQUENCIES

COUNTY TM DM OTHER TOTAL TM DM OTHER TOTAL

BRONX 3,212 413 2,936 6,561 2.5% 0.3% 2.3% 5.1%

KINGS 9,773 1,393 11,025 22,191 7.7% 1.1% 8.6% 17.4%

NASSAU 3,135 1,573 10,367 15,075 2.5% 1.2% 8.1% 11.8%

NEW YORK 7,194 2,867 28,965 39,026 5.6% 2.2% 22.7% 30.6%

QUEENS 6,266 1,380 10,954 18,600 4.9% 1.1% 8.6% 14.6%

RICHMOND 784 277 1,772 2,833 0.6% 0.2% 1.4% 2.2%

SUFFOLK 2,911 1,042 7,159 11,112 2.3% 0.8% 5.6% 8.7%

WESTCHESTER 2,711 1,230 8,271 12,212 2.1% 1.0% 6.5% 9.6%

TOTAL 35,986 10,175 81,449 127,610 28.2% 8.0% 63.8% 100.0%

The Lure of Statistics: Data Mining Using Familiar Tools 137

30,000

25,000

20,000

15,000

10,000

5,000

0

25,000-30,000

WESTCHESTER

SUFFOLK

20,000-25,000

RICHMOND

QUEENS

OTHER

15,000-20,000

NEW YORK

10,000-15,000

NASSAU

KINGS

TM

5,000-10,000

BRONX

0-5,000

Figure 5.6 A surface plot provides a visual interface for cross-tabulated data.

Statistical Measures for Continuous Variables

The most basic statistical measures describe a set of data with just a single

value. The most commonly used statistic is the mean or average value (the sum

of all the values divided by the number of them). Some other important things

to look at are:

Range. The range is the difference between the smallest and largest obser

vation in the sample. The range is often looked at along with the mini

mum and maximum values themselves.

Mean. This is what is called an average in everyday speech.

Median. The median value is the one which splits the observations into

two equally sized groups, one having observations smaller than the

median and another containing observations larger than the median.

Mode. This is the value that occurs most often.

The median can be used in some situations where it is impossible to calcu

late the mean, such as when incomes are reported in ranges of $10,000 dollars

with a final category “over $100,000.” The number of observations are known

in each group, but not the actual values. In addition, the median is less affected

by a few observations that are out of line with the others. For instance, if Bill

Gates moves onto your block, the average net worth of the neighborhood will

dramatically increase. However, the median net worth may not change at all.

138 Chapter 5

In addition, various ways of characterizing the range are useful. The range

itself is defined by the minimum and maximum value. It is often worth looking

at percentile information, such as the 25th and 75th percentile, to understand the

limits of the middle half of the values as well.

Figure 5.7 shows a chart where the range and average are displayed for order

amount by day. This chart uses a logarithmic (log) scale for the vertical axis,

because the minimum order is under $10 and the maximum over $1,000. In fact,

the minimum is consistently around $10, the average around $70, and the max

imum around $1,000. As with discrete variables, it is valuable to use a time

chart for continuous values to see when unexpected things are happening.

Variance and Standard Deviation

Variance is a measure of the dispersion of a sample or how closely the obser

vations cluster around the average. The range is not a good measure of

dispersion because it takes only two values into account”the extremes.

Removing one extreme can, sometimes, dramatically change the range. The

variance, on the other hand, takes every value into account. The difference

between a given observation and the mean of the sample is called its deviation.

The variance is defined as the average of the squares of the deviations.

Standard deviation, the square root of the variance, is the most frequently

used measure of dispersion. It is more convenient than variance because it is

expressed in the same units as the observations rather than in terms of those

units squared. This allows the standard deviation itself to be used as a unit of

measurement. The z-score, which we used earlier, is an observation™s distance

from the mean measured in standard deviations. Using the normal distribu

tion, the z-score can be converted to a probability or confidence level.

$10,000

Order Amount (Log Scale)

$1,000 Max Order

$100

Average

$10

Min Order

$1

Jan Feb Mar Apr May Jun Jul

Figure 5.7 A time chart can also be used for continuous values; this one shows the range

and average for order amounts each day.

The Lure of Statistics: Data Mining Using Familiar Tools 139

A Couple More Statistical Ideas

Correlation is a measure of the extent to which a change in one variable is

related to a change in another. Correlation ranges from “1 to 1. A correlation of

0 means that the two variables are not related. A correlation of 1 means that as

the first variable changes, the second is guaranteed to change in the same

direction, though not necessarily by the same amount. Another measure of

correlation is the R2 value, which is the correlation squared and goes from 0

(no relationship) to 1 (complete relationship). For instance, the radius and the

circumference of a circle are perfectly correlated, although the latter grows

faster than the former. A negative correlation means that the two variables

move in opposite directions. For example, altitude is negatively correlated to

air pressure.

Regression is the process of using the value of one of a pair of correlated vari

ables in order to predict the value of the second. The most common form of

regression is linear regression, so called because it attempts to fit a straight line

through the observed X and Y pairs in a sample. Once the line has been estab

lished, it can be used to predict a value for Y given any X and for X given any Y.

Measuring Response

This section looks at statistical ideas in the context of a marketing campaign.

The champion-challenger approach to marketing tries out different ideas

against the business as usual. For instance, assume that a company sends out

a million billing inserts each month to entice customers to do something. They

have settled on one approach to the bill inserts, which is the champion offer.

Another offer is a challenger to this offer. Their approach to comparing these is:

Send the champion offer to 900,000 customers.

––

Send the challenger offer to 100,000 customers.

––

Determine which is better.

––

The question is, how do we know when one offer is better than another? This

section introduces the ideas of confidence to understand this in more detail.

Standard Error of a Proportion

The approach to answering this question uses the idea of a confidence interval.