<<

. 30
( 137 .)



>>

data and can have either absolute quantities (204 times) or percentage (14.6
percent). Often, there are too many values to show in a single histogram such
as this case where there are over 30 additional codes grouped into the “other”
category.
In addition to the values for each category, this histogram also shows the
cumulative proportion of stops, whose scale is shown on the left-hand side.
Through the cumulative histogram, it is possible to see that the top three codes
account for about 50 percent of stops, and the top 10, almost 90 percent. As an
aesthetic note, the grid lines intersect both the left- and right-hand scales at
sensible points, making it easier to read values off of the chart.


12,500 100%


10,048
10,000 80%




Cumulative Proportion
Number of Stops




7,500 60%
5,944

4,884
5,000 40%
3,851
3,549 3,311
3,054

2,500 20%
1,491 1,306 1,226 1,108


0 0%
TI NO OT VN PE CM CP NR MV EX OTHER

Stop Reason Code


Figure 5.1 This example shows both a histogram (as a vertical bar chart) and cumulative
proportion (as a line) on the same chart for stop reasons associated with a particular
marketing effort.
128 Chapter 5


Time Series
Histograms are quite useful and easily made with Excel or any statistics pack­
age. However, histograms describe a single moment. Data mining is often
concerned with what is happening over time. A key question is whether the
frequency of values is constant over time.
Time series analysis requires choosing an appropriate time frame for the
data; this includes not only the units of time, but also when we start counting
from. Some different time frames are the beginning of a customer relationship,
when a customer requests a stop, the actual stop date, and so on. Different
fields belong in different time frames. For example:
Fields describing the beginning of a customer relationship”such as
––

original product, original channel, or original market”should be
looked at by the customer™s original start date.
Fields describing the end of a customer relationship”such as last
––

product, stop reason, or stop channel”should be looked at by the cus-
tomer™s stop date or the customer™s tenure at that point in time.
Fields describing events during the customer relationship”such as
––

product upgrade or downgrade, response to a promotion, or a late
payment”should be looked at by the date of the event, the customer™s
tenure at that point in time, or the relative time since some other event.
The next step is to plot the time series as shown in Figure 5.2. This figure has
two series for stops by stop date. One shows a particular stop type over time
(price increase stops) and the other, the total number of stops. Notice that the
units for the time axis are in days. Although much business reporting is done
at the weekly and monthly level, we prefer to look at data by day in order to
see important patterns that might emerge at a fine level of granularity, patterns
that might be obscured by summarization. In this case, there is a clear up and
down wiggling pattern in both lines. This is due to a weekly cycle in stops. In
addition, the lighter line is for the price increase related stops. These clearly
show a marked increase starting in February, due to a change in pricing.

T I P When looking at field values over time, look at the data by day to get a
feel for the data at the most granular level.

A time series chart has a wealth of information. For example, fitting a line to
the data makes it possible to see and quantify long term trends, as shown in
Figure 5.2. Be careful when doing this, because of seasonality. Partial years
might introduce inadvertent trends, so include entire years when using a best-
fit line. The trend in this figure shows an increase in stops. This may be nothing
to worry about, especially since the number of customers is also increasing
over this period of time. This suggests that a better measure would be the stop
rate, rather than the raw number of stops.
The Lure of Statistics: Data Mining Using Familiar Tools 129




price complaint stops
best fit line shows
increasing trend in
overall stops by day overall stops




May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun


Figure 5.2 This chart shows two time series plotted with different scales. The dark line is
for overall stops; the light line for pricing related stops shows the impact of a change in
pricing strategy at the end of January.




Standardized Values
A time series chart provides useful information. However, it does not give an
idea as to whether the changes over time are expected or unexpected. For this,
we need some tools from statistics.
One way of looking at a time series is as a partition of all the data, with a little
bit on each day. The statistician now wants to ask a skeptical question: “Is it pos­
sible that the differences seen on each day are strictly due to chance?” This is the
null hypothesis, which is answered by calculating the p-value”the probability
that the variation among values could be explained by chance alone.
Statisticians have been studying this fundamental question for over a cen­
tury. Fortunately, they have also devised some techniques for answering it.
This is a question about sample variation. Each day represents a sample of
stops from all the stops that occurred during the period. The variation in stops
observed on different days might simply be due to an expected variation in
taking random samples.
There is a basic theorem in statistics, called the Central Limit Theorem,
which says the following:
As more and more samples are taken from a population, the distribution of the
averages of the samples (or a similar statistic) follows the normal distribution.
The average (what statisticians call the mean) of the samples comes arbitrarily
close to the average of the entire population.
130 Chapter 5


The Central Limit Theorem is actually a very deep theorem and quite inter­
esting. More importantly, it is useful. In the case of discrete variables, such as
number of customers who stop on each day, the same idea holds. The statistic
used for this example is the count of the number of stops on each day, as
shown earlier in Figure 5.2. (Strictly speaking, it would be better to use a pro­
portion, such as the ratio of stops to the number of customers; this is equiva­
lent to the count for our purposes with the assumption that the number of
customers is constant over the period.)
The normal distribution is described by two parameters, the mean and the
standard deviation. The mean is the average count for each day. The standard
deviation is a measure of the extent to which values tend to cluster around the
mean and is explained more fully later in the chapter; for now, using a function
such as STDEV() in Excel or STDDEV() in SQL is sufficient. For the time series,
the standard deviation is the standard deviation of the daily counts. Assuming
that the values for each day were taken randomly from the stops for the entire
period, the set of counts should follow a normal distribution. If they don™t
follow a normal distribution, then something besides chance is affecting the
values. Notice that this does not tell us what is affecting the values, only that
the simplest explanation, sample variation, is insufficient to explain them.
This is the motivation for standardizing time series values. This process pro­
duces the number of standard deviations from the average:
Calculate the average value for all days.
––

Calculate the standard deviation for all days.
––

For each value, subtract the average and divide by the standard deviation
––

to get the number of standard deviations from the average.
The purpose of standardizing the values is to test the null hypothesis. When
true, the standardized values should follow the normal distribution (with an
average of 0 and a standard deviation of 1), exhibiting several useful proper­
ties. First, the standardized value should take on negative values and positive
values with about equal frequency. Also, when standardized, about two-thirds
(68.4 percent) of the values should be between minus one and one. A bit over
95 percent of the values should be between “2 and 2. And values over 3 or less
than “3 should be very, very rare”probably not visible in the data. Of course,
“should” here means that the values are following the normal distribution and
the null hypothesis holds (that is, all time related effects are explained by sam­
ple variation). When the null hypothesis does not hold, it is often apparent
from the standardized values. The aside, “A Question of Terminology,” talks a
bit more about distributions, normal and otherwise.
Figure 5.3 shows the standardized values for the data in Figure 5.2. The first
thing to notice is that the shape of the standardized curve is very similar to the
shape of the original data; what has changed is the scale on the vertical dimen­
sion. When comparing two curves, the scales for each change. In the previous
The Lure of Statistics: Data Mining Using Familiar Tools 131


figure, overall stops were much larger than pricing stops, so the two were
shown using different scales. In this case, the standardized pricing stops are
towering over the standardized overall stops, even though both are on the
same scale.
The overall stops in Figure 5.3 are pretty typically normal, with the follow­
ing caveats. There is a large peak in December, which probably needs to be
explained because the value is over four standard deviations away from the
average. Also, there is a strong weekly trend. It would be a good idea to repeat
this chart using weekly stops instead of daily stops, to see the variation on the
weekly level.
The lighter line showing the pricing related stops clearly does not follow the
normal distribution. Many more values are negative than positive. The peak is
at over 13”which is way, way too high.
Standardized values, or z-values as they are often called, are quite useful. This
example has used them for looking at values over time too see whether the val­
ues look like they were taken randomly on each day; that is, whether the varia­
tion in daily values could be explained by sampling variation. On days when
the z-value is relatively high or low, then we are suspicious that something else
is at work, that there is some other factor affecting the stops. For instance, the
peak in pricing stops occurred because there was a change in pricing. The effect
is quite evident in the daily z-values.
The z-value is useful for other reasons as well. For instance, it is one way of
taking several variables and converting them to similar ranges. This can be
useful for several data mining techniques, such as clustering and neural net­
works. Other uses of the z-value are covered in Chapter 17, which discusses
data transformations.


14
13
12
Standard Deviations from Mean




11
10
9
8
7
(Z-Value)




6
5
4
3
2
1
0
-1
-2
May


Jun

Jul

Aug


Sep

Oct


Nov

<<

. 30
( 137 .)



>>