<<

. 35
( 137 .)



>>


Chi-Square Value
The deviation is a good tool for looking at values. However, it does not pro­
vide information as to whether the deviation is expected or not expected.
Doing this requires some more tools from statistics, namely, the chi-square dis­
tribution developed by the English statistician Karl Pearson in 1900.
The chi-square value for each cell is simply the calculation:
(x - expected(x))2
Chi-square(x) =
expected(x)

The chi-square value for the entire table is the sum of the chi-square values of
all the cells in the table. Notice that the chi-square value is always 0 or positive.
Also, when the values in the table match the expected value, then the overall
chi-square is 0. This is the best that we can do. As the deviations from the
expected value get larger in magnitude, the chi-square value also gets larger.
Unfortunately, chi-square values do not follow a normal distribution. This is
actually obvious, because the chi-square value is always positive, and the nor­
mal distribution is symmetric. The good news is that chi-square values follow
another distribution, which is also well understood. However, the chi-square
152 Chapter 5


distribution depends not only on the value itself but also on the size of the table.
Figure 5.9 shows the density functions for several chi-square distributions.
What the chi-square depends on is the degrees of freedom. Unlike many
ideas in probability and statistics, degrees of freedom is easier to calculate than
to explain. The number of degrees of freedom of a table is calculated by sub­
tracting one from the number of rows and the number of columns and multi­
plying them together. The 2 — 2 table in the previous example has 1 degree of
freedom. A 5 — 7 table would have 24 (4 * 6) degrees of freedom. The aside
“Degrees of Freedom” discusses this in a bit more detail.

WA R N I N G The chi-square test does not work when the number of expected
values in any cell is less than 5 (and we prefer a slightly higher bound).
Although this is not an issue for large data mining problems, it can be an issue




Y
when analyzing results from a small test.




FL
The process for using the chi-square test is:
AM
Calculate the expected values.

––

Calculate the deviations from expected.

––

Calculate the chi-square (square the deviations and divide by the

––
TE

expected).
Sum for an overall chi-square value for the table.
––

Calculate the probability that the observed values are due to chance
––

(in Excel, you can use the CHIDIST function).



5%



4%
dof = 2
Probability Density




3%



2%

dof = 3

1%

dof = 10

dof = 20
0%

0 5 10 15 20 25 30 35

Chi-Square Value


Figure 5.9 The chi-square distribution depends on something called the degrees of
freedom. In general, though, it starts low, peaks early, and gradually descends.




Team-Fly®
The Lure of Statistics: Data Mining Using Familiar Tools 153


DEGREES OF FREEDOM

The idea behind the degrees of freedom is how many different variables are
needed to describe the table of expected values. This is a measure of how
constrained the data is in the table.
If the table has r rows and c columns, then there are r * c cells in the table.
With no constraints on the table, this is the number of variables that would be
needed. However, the calculation of the expected values has imposed some
constraints. In particular, the sum of the values in each row is the same for the
expected values as for the original table, because the sum of each row is fixed.
That is, if one value were missing, we could recalculate it by taking the constraint
into account by subtracting the sum of the rest of values in the row from the sum
for the whole row. This suggests that the degrees of freedom is r * c “ r. The same
situation exists for the columns, yielding an estimate of r * c “ r “ c.
However, there is one additional constraint. The sum of all the row sums and
the sum of all the column sums must be the same. It turns out, we have over
counted the constraints by one, so the degrees of freedom is really r * c “ r “ c
+ 1. Another way of writing this is ( r “ 1) * (c “ 1).



The result is the probability that the distribution of values in the table is due
to random fluctuations rather than some external criteria. As Occam™s Razor
suggests, the simplest explanation is that there is no difference at all due to the
various factors; that observed differences from expected values are entirely
within the range of expectation.


Comparison of Chi-Square to Difference of Proportions
Chi-square and difference of proportions can be applied to the same problems.
Although the results are not exactly the same, the results are similar enough
for comfort. Earlier, in Table 5.4, we determined the likelihood of champion
and challenger results being the same using the difference of proportions
method for a range of champion response rates. Table 5.7 repeats this using
the chi-square calculation instead of the difference of proportions. The
results from the chi-square test are very similar to the results from the differ­
ence of proportions”a remarkable result considering how different the two
methods are.
154



Chi-Square Calculation for Difference of Proportions Example in Table 5.4
Table 5.7
CHALLENGER CHAMPION CHAL CHAMP DIFF
CHALLENGER CHAMPION EXP EXP CHI-SQUARE CHI-SQUARE CHI-SQUARE PROP
Chapter 5





NON NON­ OVERALL NON NON NON NON
RESP RESP RESP RESP RESP RESP RESP RESP RESP RESP RESP RESP RESP VALUE P-VALUE P-VALUE

5,000 95,000 40,500 859,500 4.55% 4,550 95,450 40,950 859,050 44.51 2.12 4.95 0.24 51.81 0.00% 0.00%

5,000 95,000 41,400 858,600 4.64% 4,640 95,360 41,760 858,240 27.93 1.36 3.10 0.15 32.54 0.00% 0.00%

5,000 95,000 42,300 857,700 4.73% 4,730 95,270 42,570 857,430 15.41 0.77 1.71 0.09 17.97 0.00% 0.00%

5,000 95,000 43,200 856,800 4.82% 4,820 95,180 43,380 856,620 6.72 0.34 0.75 0.04 7.85 0.51% 0.58%

5,000 95,000 44,100 855,900 4.91% 4,910 95,090 44,190 855,810 1.65 0.09 0.18 0.01 1.93 16.50% 16.83%

5,000 95,000 45,000 855,000 5.00% 5,000 95,000 45,000 855,000 0.00 0.00 0.00 0.00 0.00 100.00% 100.00%

5,000 95,000 45,900 854,100 5.09% 5,090 94,910 45,810 854,190 1.59 0.09 0.18 0.01 1.86 17.23% 16.91%

5,000 95,000 46,800 853,200 5.18% 5,180 94,820 46,620 853,380 6.25 0.34 0.69 0.04 7.33 0.68% 0.60%

5,000 95,000 47,700 852,300 5.27% 5,270 94,730 47,430 852,570 13.83 0.77 1.54 0.09 16.23 0.01% 0.00%

5,000 95,000 48,600 851,400 5.36% 5,360 94,640 48,240 851,760 24.18 1.37 2.69 0.15 28.39 0.00% 0.00%

5,000 95,000 49,500 850,500 5.45% 5,450 94,550 49,050 850,950 37.16 2.14 4.13 0.24 43.66 0.00% 0.00%
The Lure of Statistics: Data Mining Using Familiar Tools 155


An Example: Chi-Square for Regions and Starts

A large consumer-oriented company has been running acquisition campaigns
in the New York City area. The purpose of this analysis is to look at their acqui­
sition channels to try to gain an understanding of different parts of the area.
For the purposes of this analysis, three channels are of interest:
Telemarketing. Customers who are acquired through outbound telemar­
keting calls (note that this data was collected before the national do-not-
call list went into effect).
Direct mail. Customers who respond to direct mail pieces.
Other. Customers who come in through other means.
The area of interest consists of eight counties in New York State. Five of
these counties are the boroughs of New York City, two others (Nassau and Suf­
folk counties) are on Long Island, and one (Westchester) lies just north of the
city. This data was shown earlier in Table 5.1. This purpose of this analysis is to
determine whether the breakdown of starts by channel and county is due to
chance or whether some other factors might be at work.
This problem is particularly suitable for chi-square because the data can be
laid out in rows and columns, with no customer being counted in more than
one cell. Table 5.8 shows the deviation, expected values, and chi-square values
for each combination in the table. Notice that the chi-square values are often
quite large in this example. The overall chi-square score for the table is 7,200,
which is very large; the probability that the overall score is due to chance is
basically 0. That is, the variation among starts by channel and by region is not
due to sample variation. There are other factors at work.
The next step is to determine which of the values are too high and too low
and with what probability. It is tempting to convert each chi-square value in
each cell into a probability, using the degrees of freedom for the table. The
table is 8 — 3, so it has 14 degrees of freedom. However, this is not an appro­
priate thing to do. The chi-square result is for the entire table; inverting the
individual scores to get a probability does not produce valid results. Chi-
square scores are not additive.
An alternative approach proves more accurate. The idea is to compare each
cell to everything else. The result is a table that has two columns and two rows,
as shown in Table 5.9. One column is the column of the original cell; the other
column is everything else. One row is the row of the original cell; the other row
is everything else.
156



Chi-Square Calculation for Counties and Channels Example
Table 5.8

EXPECTED DEVIATION CHI-SQUARE
COUNTY TM DM OTHER TM DM OTHER TM DM OTHER
Chapter 5





BRONX 1,850.2 523.1 4,187.7 1,362 “110 “1,252 1,002.3 23.2 374.1
KINGS 6,257.9 1,769.4 14,163.7 3,515 “376 “3,139 1,974.5 80.1 695.6
NASSAU 4,251.1 1,202.0 9,621.8 “1,116 371 745 293.0 114.5 57.7
NEW YORK 11,005.3 3,111.7 24,908.9 “3,811 “245 4,056 1,319.9 19.2 660.5
QUEENS 5,245.2 1,483.1 11,871.7 1,021 “103 “918 198.7 7.2 70.9
RICHMOND 798.9 225.9 1,808.2 “15 51 “36 0.3 11.6 0.7
SUFFOLK 3,133.6 886.0 7,092.4 “223 156 67 15.8 27.5 0.6
WESTCHESTER 3,443.8 973.7 7,794.5 “733 256 477 155.9 67.4 29.1
The Lure of Statistics: Data Mining Using Familiar Tools 157


Table 5.9 Chi-Square Calculation for Bronx and TM
EXPECTED DEVIATION CHI-SQUARE
COUNTY TM NOT_TM TM NOT_TM TM NOT_TM

BRONX 1,850.2 4,710.8 1,361.8 “1,361.8 1,002.3 393.7

NOT BRONX 34,135.8 86,913.2 “1,361.8 1,361.8 54.3 21.3


The result is a set of chi-square values for the Bronx-TM combination, in a
table with 1 degree of freedom. The Bronx-TM score by itself is a good approx­
imation of the overall chi-square value for the 2 — 2 table (this assumes that the
original cells are roughly the same size). The calculation for the chi-square
value uses this value (1002.3) with 1 degree of freedom. Conveniently, the chi-

<<

. 35
( 137 .)



>>