Jan

Feb

Mar

Apr

May

Jun

Figure 5.3 Standardized values make it possible to compare different groups on the same

chart using the same scale; this shows overall stops and price increase related stops.

132 Chapter 5

A QUESTION OF TERMINOLOGY

One very important idea in statistics is the idea of a distribution. For a discrete

variable, a distribution is a lot like a histogram”it tells how often a given value

occurs as a probability between 0 and 1. For instance, a uniform distribution

says that all values are equally represented. An example of a uniform

distribution would occur in a business where customers pay by credit card

and the same number of customers pays with American Express, Visa, and

MasterCard.

The normal distribution, which plays a very special role in statistics, is an

example of a distribution for a continuous variable. The following figure shows

the normal (sometimes called Gaussian or bell-shaped) distribution with a

mean of 0 and a standard deviation of 1. The way to read this curve is to

look at areas between two points. For a value that follows the normal

Y

distribution, the probability that the value falls between two values”for

FL

example, between 0 and 1”is the area under the curve. For the values of 0

and 1, the probability is 34.1 percent; this means that 34.1 percent of the time

a variable that follows a normal distribution will take on a value within one

AM

standard deviation above the mean. Because the curve is symmetric, there is

an additional 34.1% probability of being one standard deviation below the

mean, and hence 68.2% probability of being within one standard deviation

above the mean.

TE

40%

35%

30%

Probability Density

25%

20%

15%

10%

5%

0%

-5 -4 -3 -2 -1 0 1 2 3 4 5

Z-Value

The probability density function for the normal distribution looks like the familiar

bell-shaped curve.

Team-Fly®

The Lure of Statistics: Data Mining Using Familiar Tools 133

A QUESTION OF TERMINOLOGY (continued)

The previous paragraph showed a picture of a bell-shaped curve and called it

the normal distribution. Actually, the correct terminology is density function (or

probability density function). Although this terminology derives from advanced

mathematical probability theory, it makes sense. The density function gives a

flavor for how “dense” a variable is. We use a density function by measuring

the area under the curve between two points, rather than by reading the

individual values themselves. In the case of the normal distribution, the values

are densest around the 0 and less dense as we move away.

The following figure shows the function that is properly called the normal

distribution. This form, ranging from 0 to 1, is also called a cumulative

distribution function. Mathematically, the distribution function for a value X is

defined as the probability that the variable takes on a value less than or equal

to X. Because of the “less than or equal to” characteristic, this function always

starts near 0, climbs upward, and ends up close to 1. In general, the density

function provides more visual clues to the human about what is going on with

a distribution. Because density functions provide more information, they are

often referred to as distributions, although that is technically incorrect.

100%

90%

Proportion Less Than Z

80%

70%

60%

50%

40%

30%

20%

10%

0%

-5 -4 -3 -2 -1 0 1 2 3 4 5

Z-Value

The (cumulative) distribution function for the normal distribution has an S-shape and

is antisymmetric around the Y-axis.

From Standardized Values to Probabilities

Assuming that the standardized value follows the normal distribution makes

it possible to calculate the probability that the value would have occurred by

chance. Actually, the approach is to calculate the probability that something

further from the mean would have occurred”the p-value. The reason the

exact value is not worth asking is because any given z-value has an arbitrarily

134 Chapter 5

small probability. Probabilities are defined on ranges of z-values as the area

under the normal curve between two points.

Calculating something further from the mean might mean either of two

things:

The probability of being more than z standard deviations from the

––

mean.

The probability of being z standard deviations greater than the mean

––

(or alternatively z standard deviations less than the mean).

The first is called a two-tailed distribution and the second is called a one-

tailed distribution. The terminology is clear in Figure 5.4, because the tails of

the distributions are being measured. The two-tailed probability is always

twice as large as the one-tailed probability for z-values. Hence, the two-tailed

p-value is more pessimistic than the one-tailed one; that is, the two-tailed is

more likely to assume that the null hypothesis is true. If the one-tailed says the

probability of the null hypothesis is 10 percent, then the two-tailed says it is 20

percent. As a default, it is better to use the two-tailed probability for calcula

tions to be on the safe side.

The two-tailed p-value can be calculated conveniently in Excel, because

there is a function called NORMSDIST, which calculates the cumulative nor

mal distribution. Using this function, the two-tailed p-value is 2 * NORMS-

DIST(“ABS(z)). For a value of 2, the result is 4.6 percent. This means that there

is a 4.6 percent chance of observing a value more than two standard deviations

from the average”plus or minus two standard deviations from the average.

Or, put another way, there is a 95.4 percent confidence that a value falling out

side two standard deviations is due to something besides chance. For a precise

95 percent confidence, a bound of 1.96 can be used instead of 2. For 99 percent

confidence, the limit is 2.58. The following shows the limits on the z-value for

some common confidence levels:

90% confidence ’ z-value > 1.64

––

95% confidence ’ z-value > 1.96

––

99% confidence ’ z-value > 2.58

––

99.5% confidence ’ z-value > 2.81

––

99.9% confidence ’ z-value > 3.29

––

99.99% confidence ’ z-value > 3.89

––

The confidence has the property that it is close to 100 percent when the value

is unlikely to be due to chance and close to 0 when it is. The signed confidence

adds information about whether the value is too low or too high. When the

observed value is less than the average, the signed confidence is negative.

The Lure of Statistics: Data Mining Using Familiar Tools 135

40%

35% Shaded area is one-tailed

Both shaded areas are

probability of being two or

30%

Probability Density

two-tailed probability of

more standard deviations

being two or more

25% above average.

standard deviations

20%

from average (greater

or less than).

15%

10%

5%

0%

-5 -4 -3 -2 -1 0 1 2 3 4 5

Z-Value

Figure 5.4 The tail of the normal distribution answers the question: “What is the

probability of getting a value of z or greater?”

Figure 5.5 shows the signed confidence for the data shown earlier in Figures

5.2 and 5.3, using the two-tailed probability. The shape of the signed confi

dence is different from the earlier shapes. The overall stops bounce around,

usually remaining within reasonable bounds. The pricing-related stops,

though, once again show a very distinct pattern, being too low for a long time,

then peaking and descending. The signed confidence levels are bounded by

100 percent and “100 percent. In this chart, the extreme values are near 100 per

cent or “100 percent, and it is hard to tell the difference between 99.9 percent

and 99.99999 percent. To distinguish values near the extremes, the z-values in

Figure 5.3 are better than the signed confidence.

100%

75%

50%

Signed Confidence

25%

(Q-Value)

0%