Chi-Square Value

The deviation is a good tool for looking at values. However, it does not pro

vide information as to whether the deviation is expected or not expected.

Doing this requires some more tools from statistics, namely, the chi-square dis

tribution developed by the English statistician Karl Pearson in 1900.

The chi-square value for each cell is simply the calculation:

(x - expected(x))2

Chi-square(x) =

expected(x)

The chi-square value for the entire table is the sum of the chi-square values of

all the cells in the table. Notice that the chi-square value is always 0 or positive.

Also, when the values in the table match the expected value, then the overall

chi-square is 0. This is the best that we can do. As the deviations from the

expected value get larger in magnitude, the chi-square value also gets larger.

Unfortunately, chi-square values do not follow a normal distribution. This is

actually obvious, because the chi-square value is always positive, and the nor

mal distribution is symmetric. The good news is that chi-square values follow

another distribution, which is also well understood. However, the chi-square

152 Chapter 5

distribution depends not only on the value itself but also on the size of the table.

Figure 5.9 shows the density functions for several chi-square distributions.

What the chi-square depends on is the degrees of freedom. Unlike many

ideas in probability and statistics, degrees of freedom is easier to calculate than

to explain. The number of degrees of freedom of a table is calculated by sub

tracting one from the number of rows and the number of columns and multi

plying them together. The 2 — 2 table in the previous example has 1 degree of

freedom. A 5 — 7 table would have 24 (4 * 6) degrees of freedom. The aside

“Degrees of Freedom” discusses this in a bit more detail.

WA R N I N G The chi-square test does not work when the number of expected

values in any cell is less than 5 (and we prefer a slightly higher bound).

Although this is not an issue for large data mining problems, it can be an issue

Y

when analyzing results from a small test.

FL

The process for using the chi-square test is:

AM

Calculate the expected values.

––

Calculate the deviations from expected.

––

Calculate the chi-square (square the deviations and divide by the

––

TE

expected).

Sum for an overall chi-square value for the table.

––

Calculate the probability that the observed values are due to chance

––

(in Excel, you can use the CHIDIST function).

5%

4%

dof = 2

Probability Density

3%

2%

dof = 3

1%

dof = 10

dof = 20

0%

0 5 10 15 20 25 30 35

Chi-Square Value

Figure 5.9 The chi-square distribution depends on something called the degrees of

freedom. In general, though, it starts low, peaks early, and gradually descends.

Team-Fly®

The Lure of Statistics: Data Mining Using Familiar Tools 153

DEGREES OF FREEDOM

The idea behind the degrees of freedom is how many different variables are

needed to describe the table of expected values. This is a measure of how

constrained the data is in the table.

If the table has r rows and c columns, then there are r * c cells in the table.

With no constraints on the table, this is the number of variables that would be

needed. However, the calculation of the expected values has imposed some

constraints. In particular, the sum of the values in each row is the same for the

expected values as for the original table, because the sum of each row is fixed.

That is, if one value were missing, we could recalculate it by taking the constraint

into account by subtracting the sum of the rest of values in the row from the sum

for the whole row. This suggests that the degrees of freedom is r * c “ r. The same

situation exists for the columns, yielding an estimate of r * c “ r “ c.

However, there is one additional constraint. The sum of all the row sums and

the sum of all the column sums must be the same. It turns out, we have over

counted the constraints by one, so the degrees of freedom is really r * c “ r “ c

+ 1. Another way of writing this is ( r “ 1) * (c “ 1).

The result is the probability that the distribution of values in the table is due

to random fluctuations rather than some external criteria. As Occam™s Razor

suggests, the simplest explanation is that there is no difference at all due to the

various factors; that observed differences from expected values are entirely

within the range of expectation.

Comparison of Chi-Square to Difference of Proportions

Chi-square and difference of proportions can be applied to the same problems.

Although the results are not exactly the same, the results are similar enough

for comfort. Earlier, in Table 5.4, we determined the likelihood of champion

and challenger results being the same using the difference of proportions

method for a range of champion response rates. Table 5.7 repeats this using

the chi-square calculation instead of the difference of proportions. The

results from the chi-square test are very similar to the results from the differ

ence of proportions”a remarkable result considering how different the two

methods are.

154

Chi-Square Calculation for Difference of Proportions Example in Table 5.4

Table 5.7

CHALLENGER CHAMPION CHAL CHAMP DIFF

CHALLENGER CHAMPION EXP EXP CHI-SQUARE CHI-SQUARE CHI-SQUARE PROP

Chapter 5

NON NON OVERALL NON NON NON NON

RESP RESP RESP RESP RESP RESP RESP RESP RESP RESP RESP RESP RESP VALUE P-VALUE P-VALUE

5,000 95,000 40,500 859,500 4.55% 4,550 95,450 40,950 859,050 44.51 2.12 4.95 0.24 51.81 0.00% 0.00%

5,000 95,000 41,400 858,600 4.64% 4,640 95,360 41,760 858,240 27.93 1.36 3.10 0.15 32.54 0.00% 0.00%

5,000 95,000 42,300 857,700 4.73% 4,730 95,270 42,570 857,430 15.41 0.77 1.71 0.09 17.97 0.00% 0.00%

5,000 95,000 43,200 856,800 4.82% 4,820 95,180 43,380 856,620 6.72 0.34 0.75 0.04 7.85 0.51% 0.58%

5,000 95,000 44,100 855,900 4.91% 4,910 95,090 44,190 855,810 1.65 0.09 0.18 0.01 1.93 16.50% 16.83%

5,000 95,000 45,000 855,000 5.00% 5,000 95,000 45,000 855,000 0.00 0.00 0.00 0.00 0.00 100.00% 100.00%

5,000 95,000 45,900 854,100 5.09% 5,090 94,910 45,810 854,190 1.59 0.09 0.18 0.01 1.86 17.23% 16.91%

5,000 95,000 46,800 853,200 5.18% 5,180 94,820 46,620 853,380 6.25 0.34 0.69 0.04 7.33 0.68% 0.60%

5,000 95,000 47,700 852,300 5.27% 5,270 94,730 47,430 852,570 13.83 0.77 1.54 0.09 16.23 0.01% 0.00%

5,000 95,000 48,600 851,400 5.36% 5,360 94,640 48,240 851,760 24.18 1.37 2.69 0.15 28.39 0.00% 0.00%

5,000 95,000 49,500 850,500 5.45% 5,450 94,550 49,050 850,950 37.16 2.14 4.13 0.24 43.66 0.00% 0.00%

The Lure of Statistics: Data Mining Using Familiar Tools 155

An Example: Chi-Square for Regions and Starts

A large consumer-oriented company has been running acquisition campaigns

in the New York City area. The purpose of this analysis is to look at their acqui

sition channels to try to gain an understanding of different parts of the area.

For the purposes of this analysis, three channels are of interest:

Telemarketing. Customers who are acquired through outbound telemar

keting calls (note that this data was collected before the national do-not-

call list went into effect).

Direct mail. Customers who respond to direct mail pieces.

Other. Customers who come in through other means.

The area of interest consists of eight counties in New York State. Five of

these counties are the boroughs of New York City, two others (Nassau and Suf

folk counties) are on Long Island, and one (Westchester) lies just north of the

city. This data was shown earlier in Table 5.1. This purpose of this analysis is to

determine whether the breakdown of starts by channel and county is due to

chance or whether some other factors might be at work.

This problem is particularly suitable for chi-square because the data can be

laid out in rows and columns, with no customer being counted in more than

one cell. Table 5.8 shows the deviation, expected values, and chi-square values

for each combination in the table. Notice that the chi-square values are often

quite large in this example. The overall chi-square score for the table is 7,200,

which is very large; the probability that the overall score is due to chance is

basically 0. That is, the variation among starts by channel and by region is not

due to sample variation. There are other factors at work.

The next step is to determine which of the values are too high and too low

and with what probability. It is tempting to convert each chi-square value in

each cell into a probability, using the degrees of freedom for the table. The

table is 8 — 3, so it has 14 degrees of freedom. However, this is not an appro

priate thing to do. The chi-square result is for the entire table; inverting the

individual scores to get a probability does not produce valid results. Chi-

square scores are not additive.

An alternative approach proves more accurate. The idea is to compare each

cell to everything else. The result is a table that has two columns and two rows,

as shown in Table 5.9. One column is the column of the original cell; the other

column is everything else. One row is the row of the original cell; the other row

is everything else.

156

Chi-Square Calculation for Counties and Channels Example

Table 5.8

EXPECTED DEVIATION CHI-SQUARE

COUNTY TM DM OTHER TM DM OTHER TM DM OTHER

Chapter 5

BRONX 1,850.2 523.1 4,187.7 1,362 “110 “1,252 1,002.3 23.2 374.1

KINGS 6,257.9 1,769.4 14,163.7 3,515 “376 “3,139 1,974.5 80.1 695.6

NASSAU 4,251.1 1,202.0 9,621.8 “1,116 371 745 293.0 114.5 57.7

NEW YORK 11,005.3 3,111.7 24,908.9 “3,811 “245 4,056 1,319.9 19.2 660.5

QUEENS 5,245.2 1,483.1 11,871.7 1,021 “103 “918 198.7 7.2 70.9

RICHMOND 798.9 225.9 1,808.2 “15 51 “36 0.3 11.6 0.7

SUFFOLK 3,133.6 886.0 7,092.4 “223 156 67 15.8 27.5 0.6

WESTCHESTER 3,443.8 973.7 7,794.5 “733 256 477 155.9 67.4 29.1

The Lure of Statistics: Data Mining Using Familiar Tools 157

Table 5.9 Chi-Square Calculation for Bronx and TM

EXPECTED DEVIATION CHI-SQUARE

COUNTY TM NOT_TM TM NOT_TM TM NOT_TM

BRONX 1,850.2 4,710.8 1,361.8 “1,361.8 1,002.3 393.7

NOT BRONX 34,135.8 86,913.2 “1,361.8 1,361.8 54.3 21.3

The result is a set of chi-square values for the Bronx-TM combination, in a

table with 1 degree of freedom. The Bronx-TM score by itself is a good approx

imation of the overall chi-square value for the 2 — 2 table (this assumes that the

original cells are roughly the same size). The calculation for the chi-square

value uses this value (1002.3) with 1 degree of freedom. Conveniently, the chi-