ñòð. 34 |

champion.

Use the 5 percent lowest and 5 percent highest value customers for the

â– â–

challenger, and everyone else for the champion.

Use the 10 percent most recent customers for the challenger, and everyÂ

â– â–

one else for the champion.

Use the customers with telephone numbers for the telemarketing camÂ

â– â–

paign; everyone else for the direct mail campaign.

All of these are biased ways of splitting the population into groups. The preÂ

vious results all assume that there is no such systematic bias. When there is

systematic bias, the formulas for the confidence intervals are not correct.

Using the formula for the confidence interval means that there is no systemÂ

atic bias in deciding whether a particular customer receives the champion or

the challenger message. For instance, perhaps there was a champion model

that predicts the likelihood of customers responding to the champion offer. If

this model were used, then the challenger sample would no longer be a ranÂ

dom sample. It would consist of the leftover customers from the champion

model. This introduces another form of bias.

Or, perhaps the challenger model is only available to customers in certain

markets or with certain products. This introduces other forms of bias. In such

a case, these customers should be compared to the set of customers receiving

the champion offer with the same constraints.

Another form of bias might come from the method of response. The chalÂ

lenger may only accept responses via telephone, but the champion may accept

them by telephone or on the Web. In such a case, the challenger response may

be dampened because of the lack of a Web channel. Or, there might need to be

special training for the inbound telephone service reps to handle the chalÂ

lenger offer. At certain times, this might mean that wait times are longer,

another form of bias.

The confidence interval is simply a statement about statistics and disperÂ

sion. It does not address all the other forms of bias that might affect results,

and these forms of bias are often more important to results than sample variaÂ

tion. The next section talks about setting up a test and control experiment in

marketing, diving into these issues in more detail.

The Lure of Statistics: Data Mining Using Familiar Tools 147

Size of Test and Control for an Experiment

The champion-challenger model is an example of a two-way test, where a new

method (the challenger) is compared to business-as-usual activity (the chamÂ

pion). This section talks about ensuring that the test and control are large

enough for the purposes at hand. The previous section talked about determinÂ

ing the confidence interval for the sample response rate. Here, we turn this

logic inside out. Instead of starting with the size of the groups, letâ€™s instead

consider sizes from the perspective of test design. This requires several items

of information:

Estimated response rate for one of the groups, which we call p

â– â–

Difference in response rates that we want to consider significant (acuity

â– â–

of the test), which we call d

Confidence interval (say 95 percent)

â– â–

This provides enough information to determine the size of the samples

needed for the test and control. For instance, suppose that the business as

usual has a response rate of 5 percent and we want to measure with 95 percent

confidence a difference of 0.2 percent. This means that if the response of the

test group greater than 5.2 percent, then the experiment can detect the differÂ

ence with a 95 percent confidence level.

For a problem of this type, the first step this is to determine the value of

SEDP. That is, if we are willing to accept a difference of 0.2 percent with a conÂ

fidence of 95 percent, then what is the corresponding standard error? A confiÂ

dence of 95 percent means that we are 1.96 standard deviations from the mean,

so the answer is to divide the difference by 1.96, which yields 0.102 percent.

More generally, the process is to convert the p-value (95 percent) to a z-value

(which can be done using the Excel function NORMSINV) and then divide the

desired confidence by this value.

The next step is to plug these values into the formula for SEDP. For this, letâ€™s

assume that the test and control are the same size:

p ) (1 - p) (1 - p - d)

0.2% )

N

1.96 N + (p + d)

Plugging in the values just described (p is 5% and d is 0.2%) results in:

0.102% = 5% ) 95% + 5.2% ) 94.8% = 0.

0963

N N N

N = 0.0963 2 = 66, 875

(0.00102)

So, having equal-sized groups of of 92,561 makes it possible to measure a 0.2

percent difference in response rates with a 95 percent accuracy. Of course, this

does not guarantee that the results will differ by at least 0.2 percent. It merely

148 Chapter 5

says that with control and test groups of at least this size, a difference in

response rates of 0.2 percent should be measurable and statistically significant.

The size of the test and control groups affects how the results can be interÂ

preted. However, this effect can be determined in advance, before the test. It is

worthwhile determining the acuity of the test and control groups before runÂ

ning the test, to be sure that the test can produce useful results.

T I P Before running a marketing test, determine the acuity of the test by

calculating the difference in response rates that can be measured with a high

confidence (such as 95 percent).

Multiple Comparisons

The discussion has so far used examples with only one comparison, such as

the difference between two presidential candidates or between a test and conÂ

trol group. Often, we are running multiple tests at the same time. For instance,

we might try out three different challenger messages to determine if one of

these produces better results than the business-as-usual message. Because

handling multiple tests does affect the underlying statistics, it is important to

understand what happens.

The Confidence Level with Multiple Comparisons

Consider that there are two groups that have been tested, and you are told that

difference between the responses in the two groups is 95 percent certain to be

due to factors other than sampling variation. A reasonable conclusion is that

there is a difference between the two groups. In a well-designed test, the most

likely reason would the difference in message, offer, or treatment.

Occamâ€™s Razor says that we should take the simplest explanation, and not

add anything extra. The simplest hypothesis for the difference in response

rates is that the difference is not significant, that the response rates are really

approximations of the same number. If the difference is significant, then we

need to search for the reason why.

Now consider the same situation, except that you are now told that there

were actually 20 groups being tested, and you were shown only one pair. Now

you might reach a very different conclusion. If 20 groups are being tested, then

you should expect one of them to exceed the 95 percent confidence bound due

only to chance, since 95 percent means 19 times out of 20. You can no longer

conclude that the difference is due to the testing parameters. Instead, because

it is likely that the difference is due to sampling variation, this is the simplest

hypothesis.

The Lure of Statistics: Data Mining Using Familiar Tools 149

The confidence level is based on only one comparison. When there are mulÂ

tiple comparisons, that condition is not true, so the confidence as calculated

previously is not quite sufficient.

Bonferroniâ€™s Correction

Fortunately, there is a simple correction to fix this problem, developed by the

Italian mathematician Carlo Bonferroni. We have been looking at confidence

as saying that there is a 95 percent chance that some value is between A and B.

Consider the following situation:

X is between A and B with a probability of 95 percent.

â– â–

Y is between C and D with a probability of 95 percent.

â– â–

Bonferroni wanted to know the probability that both of these are true.

Another way to look at it is to determine the probability that one or the other

is false. This is easier to calculate. The probability that the first is false is 5 perÂ

cent, as is the probability of the second being false. The probability that either

is false is the sum, 10 percent, minus the probability that both are false at the

same time (0.25 percent). So, the probability that both statements are true is

about 90 percent.

Looking at this from the p-value perspective says that the p-value of both

statements together (10 percent) is approximated by the sum of the p-values of

the two statements separately. This is not a coincidence. In fact, it is reasonable

to calculate the p-value of any number of statements as the sum of the

p-values of each one. If we had eight variables with a 95 percent confidence,

then we would expect all eight to be in their ranges 60 percent at any given

time (because 8 * 5% is a p-value of 40%).

Bonferroni applied this observation in reverse. If there are eight tests and we

want an overall 95 percent confidence, then the bound for the p-value needs to

be 5% / 8 = 0.625%. That is, each observation needs to be at least 99.375 percent

confident. The Bonferroni correction is to divide the desired bound for the

p-value by the number of comparisons being made, in order to get a confiÂ

dence of 1 â€“ p for all comparisons.

Chi-Square Test

The difference of proportions method is a very powerful method for estimatÂ

ing the effectiveness of campaigns and for other similar situations. However,

there is another statistical test that can be used. This test, the chi-square test, is

designed specifically for the situation when there are multiple tests and at least

two discrete outcomes (such as response and non-response).

150 Chapter 5

The appeal of the chi-square test is that it readily adapts to multiple test

groups and multiple outcomes, so long as the different groups are distinct

from each other. This, in fact, is about the only important rule when using this

test. As described in the next chapter on decision trees, the chi-square test is

the basis for one of the earliest forms of decision trees.

Expected Values

The place to start with chi-square is to lay data out in a table, as in Table 5.5.

This is a simple 2 Ã— 2 table, which represents a test group and a control group

in a test that has two outcomes, say response and nonresponse. This table also

shows the total values for each column and row; that is, the total number of

responders and nonresponders (each column) and the total number in the test

and control groups (each row). The response column is added for reference; it

is not part of the calculation.

What if the data were broken up between these groups in a completely unbiÂ

ased way? That is, what if there really were no differences between the

columns and rows in the table? This is a completely reasonable question. We

can calculate the expected values, assuming that the number of responders

and non-responders is the same, and assuming that the sizes of the champion

and challenger groups are the same. That is, we can calculate the expected

value in each cell, given that the size of the rows and columns are the same as

in the original data.

One way of calculating the expected values is to calculate the proportion of

each row that is in each column, by computing each of the following four

quantities, as shown in Table 5.6:

Proportion of everyone who responds

â– â–

Proportion of everyone who does not respond

â– â–

These proportions are then multiplied by the count for each row to obtain

the expected value. This method for calculating the expected value works

when the tabular data has more columns or more rows.

Table 5.5 The Champion-Challenger Data Laid out for the Chi-Square Test

RESPONDERS NON-RESPONDERS TOTAL RESPONSE

Champion 43,200 856,800 900,000 4.80%

Challenger 5,000 95,000 100,000 5.00%

TOTAL 48,200 951,800 1,000,000 4.82%

The Lure of Statistics: Data Mining Using Familiar Tools 151

Table 5.6 Calculating the Expected Values and Deviations from Expected for the Data in

Table 5.5

EXPECTED

ACTUAL RESPONSE RESPONSE DEVIATION

YES NO TOTAL YES NO YES NO

Champion 43,200 856,800 900,000 43,380 856,620 â€“180 180

Challenger 5,000 95,000 100,000 4,820 95,180 180 â€“180

TOTAL 48,200 951,800 1,000,000 48,200 951,800

OVERALL

PROPORTION 4.82% 95.18%

The expected value is quite interesting, because it shows how the data

would break up if there were no other effects. Notice that the expected value is

measured in the same units as each cell, typically a customer count, so it actuÂ

ally has a meaning. Also, the sum of the expected values is the same as the sum

of all the cells in the original table. The table also includes the deviation, which

is the difference between the observed value and the expected value. In this

case, the deviations all have the same value, but with different signs. This is

because the original data has two rows and two columns. Later in the chapter

there are examples using larger tables where the deviations are different.

However, the deviations in each row and each column always cancel out, so

the sum of the deviations in each row is always 0.

ñòð. 34 |