. 31
( 137 .)









Figure 5.3 Standardized values make it possible to compare different groups on the same
chart using the same scale; this shows overall stops and price increase related stops.
132 Chapter 5


One very important idea in statistics is the idea of a distribution. For a discrete
variable, a distribution is a lot like a histogram”it tells how often a given value
occurs as a probability between 0 and 1. For instance, a uniform distribution
says that all values are equally represented. An example of a uniform
distribution would occur in a business where customers pay by credit card
and the same number of customers pays with American Express, Visa, and
The normal distribution, which plays a very special role in statistics, is an
example of a distribution for a continuous variable. The following figure shows
the normal (sometimes called Gaussian or bell-shaped) distribution with a
mean of 0 and a standard deviation of 1. The way to read this curve is to
look at areas between two points. For a value that follows the normal

distribution, the probability that the value falls between two values”for

example, between 0 and 1”is the area under the curve. For the values of 0
and 1, the probability is 34.1 percent; this means that 34.1 percent of the time
a variable that follows a normal distribution will take on a value within one
standard deviation above the mean. Because the curve is symmetric, there is
an additional 34.1% probability of being one standard deviation below the
mean, and hence 68.2% probability of being within one standard deviation
above the mean.



Probability Density





-5 -4 -3 -2 -1 0 1 2 3 4 5


The probability density function for the normal distribution looks like the familiar
bell-shaped curve.

The Lure of Statistics: Data Mining Using Familiar Tools 133

The previous paragraph showed a picture of a bell-shaped curve and called it
the normal distribution. Actually, the correct terminology is density function (or
probability density function). Although this terminology derives from advanced
mathematical probability theory, it makes sense. The density function gives a
flavor for how “dense” a variable is. We use a density function by measuring
the area under the curve between two points, rather than by reading the
individual values themselves. In the case of the normal distribution, the values
are densest around the 0 and less dense as we move away.
The following figure shows the function that is properly called the normal
distribution. This form, ranging from 0 to 1, is also called a cumulative
distribution function. Mathematically, the distribution function for a value X is
defined as the probability that the variable takes on a value less than or equal
to X. Because of the “less than or equal to” characteristic, this function always
starts near 0, climbs upward, and ends up close to 1. In general, the density
function provides more visual clues to the human about what is going on with
a distribution. Because density functions provide more information, they are
often referred to as distributions, although that is technically incorrect.

Proportion Less Than Z

-5 -4 -3 -2 -1 0 1 2 3 4 5


The (cumulative) distribution function for the normal distribution has an S-shape and
is antisymmetric around the Y-axis.

From Standardized Values to Probabilities
Assuming that the standardized value follows the normal distribution makes
it possible to calculate the probability that the value would have occurred by
chance. Actually, the approach is to calculate the probability that something
further from the mean would have occurred”the p-value. The reason the
exact value is not worth asking is because any given z-value has an arbitrarily
134 Chapter 5

small probability. Probabilities are defined on ranges of z-values as the area
under the normal curve between two points.
Calculating something further from the mean might mean either of two
The probability of being more than z standard deviations from the

The probability of being z standard deviations greater than the mean

(or alternatively z standard deviations less than the mean).
The first is called a two-tailed distribution and the second is called a one-
tailed distribution. The terminology is clear in Figure 5.4, because the tails of
the distributions are being measured. The two-tailed probability is always
twice as large as the one-tailed probability for z-values. Hence, the two-tailed
p-value is more pessimistic than the one-tailed one; that is, the two-tailed is
more likely to assume that the null hypothesis is true. If the one-tailed says the
probability of the null hypothesis is 10 percent, then the two-tailed says it is 20
percent. As a default, it is better to use the two-tailed probability for calcula­
tions to be on the safe side.
The two-tailed p-value can be calculated conveniently in Excel, because
there is a function called NORMSDIST, which calculates the cumulative nor­
mal distribution. Using this function, the two-tailed p-value is 2 * NORMS-
DIST(“ABS(z)). For a value of 2, the result is 4.6 percent. This means that there
is a 4.6 percent chance of observing a value more than two standard deviations
from the average”plus or minus two standard deviations from the average.
Or, put another way, there is a 95.4 percent confidence that a value falling out­
side two standard deviations is due to something besides chance. For a precise
95 percent confidence, a bound of 1.96 can be used instead of 2. For 99 percent
confidence, the limit is 2.58. The following shows the limits on the z-value for
some common confidence levels:
90% confidence ’ z-value > 1.64

95% confidence ’ z-value > 1.96

99% confidence ’ z-value > 2.58

99.5% confidence ’ z-value > 2.81

99.9% confidence ’ z-value > 3.29

99.99% confidence ’ z-value > 3.89

The confidence has the property that it is close to 100 percent when the value
is unlikely to be due to chance and close to 0 when it is. The signed confidence
adds information about whether the value is too low or too high. When the
observed value is less than the average, the signed confidence is negative.
The Lure of Statistics: Data Mining Using Familiar Tools 135


35% Shaded area is one-tailed
Both shaded areas are
probability of being two or

Probability Density
two-tailed probability of
more standard deviations
being two or more
25% above average.
standard deviations
from average (greater
or less than).

-5 -4 -3 -2 -1 0 1 2 3 4 5


Figure 5.4 The tail of the normal distribution answers the question: “What is the
probability of getting a value of z or greater?”

Figure 5.5 shows the signed confidence for the data shown earlier in Figures
5.2 and 5.3, using the two-tailed probability. The shape of the signed confi­
dence is different from the earlier shapes. The overall stops bounce around,
usually remaining within reasonable bounds. The pricing-related stops,
though, once again show a very distinct pattern, being too low for a long time,
then peaking and descending. The signed confidence levels are bounded by
100 percent and “100 percent. In this chart, the extreme values are near 100 per­
cent or “100 percent, and it is hard to tell the difference between 99.9 percent
and 99.99999 percent. To distinguish values near the extremes, the z-values in
Figure 5.3 are better than the signed confidence.



Signed Confidence




. 31
( 137 .)