percent). Often, there are too many values to show in a single histogram such

as this case where there are over 30 additional codes grouped into the “other”

category.

In addition to the values for each category, this histogram also shows the

cumulative proportion of stops, whose scale is shown on the left-hand side.

Through the cumulative histogram, it is possible to see that the top three codes

account for about 50 percent of stops, and the top 10, almost 90 percent. As an

aesthetic note, the grid lines intersect both the left- and right-hand scales at

sensible points, making it easier to read values off of the chart.

12,500 100%

10,048

10,000 80%

Cumulative Proportion

Number of Stops

7,500 60%

5,944

4,884

5,000 40%

3,851

3,549 3,311

3,054

2,500 20%

1,491 1,306 1,226 1,108

0 0%

TI NO OT VN PE CM CP NR MV EX OTHER

Stop Reason Code

Figure 5.1 This example shows both a histogram (as a vertical bar chart) and cumulative

proportion (as a line) on the same chart for stop reasons associated with a particular

marketing effort.

128 Chapter 5

Time Series

Histograms are quite useful and easily made with Excel or any statistics pack

age. However, histograms describe a single moment. Data mining is often

concerned with what is happening over time. A key question is whether the

frequency of values is constant over time.

Time series analysis requires choosing an appropriate time frame for the

data; this includes not only the units of time, but also when we start counting

from. Some different time frames are the beginning of a customer relationship,

when a customer requests a stop, the actual stop date, and so on. Different

fields belong in different time frames. For example:

Fields describing the beginning of a customer relationship”such as

––

original product, original channel, or original market”should be

looked at by the customer™s original start date.

Fields describing the end of a customer relationship”such as last

––

product, stop reason, or stop channel”should be looked at by the cus-

tomer™s stop date or the customer™s tenure at that point in time.

Fields describing events during the customer relationship”such as

––

product upgrade or downgrade, response to a promotion, or a late

payment”should be looked at by the date of the event, the customer™s

tenure at that point in time, or the relative time since some other event.

The next step is to plot the time series as shown in Figure 5.2. This figure has

two series for stops by stop date. One shows a particular stop type over time

(price increase stops) and the other, the total number of stops. Notice that the

units for the time axis are in days. Although much business reporting is done

at the weekly and monthly level, we prefer to look at data by day in order to

see important patterns that might emerge at a fine level of granularity, patterns

that might be obscured by summarization. In this case, there is a clear up and

down wiggling pattern in both lines. This is due to a weekly cycle in stops. In

addition, the lighter line is for the price increase related stops. These clearly

show a marked increase starting in February, due to a change in pricing.

T I P When looking at field values over time, look at the data by day to get a

feel for the data at the most granular level.

A time series chart has a wealth of information. For example, fitting a line to

the data makes it possible to see and quantify long term trends, as shown in

Figure 5.2. Be careful when doing this, because of seasonality. Partial years

might introduce inadvertent trends, so include entire years when using a best-

fit line. The trend in this figure shows an increase in stops. This may be nothing

to worry about, especially since the number of customers is also increasing

over this period of time. This suggests that a better measure would be the stop

rate, rather than the raw number of stops.

The Lure of Statistics: Data Mining Using Familiar Tools 129

price complaint stops

best fit line shows

increasing trend in

overall stops by day overall stops

May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun

Figure 5.2 This chart shows two time series plotted with different scales. The dark line is

for overall stops; the light line for pricing related stops shows the impact of a change in

pricing strategy at the end of January.

Standardized Values

A time series chart provides useful information. However, it does not give an

idea as to whether the changes over time are expected or unexpected. For this,

we need some tools from statistics.

One way of looking at a time series is as a partition of all the data, with a little

bit on each day. The statistician now wants to ask a skeptical question: “Is it pos

sible that the differences seen on each day are strictly due to chance?” This is the

null hypothesis, which is answered by calculating the p-value”the probability

that the variation among values could be explained by chance alone.

Statisticians have been studying this fundamental question for over a cen

tury. Fortunately, they have also devised some techniques for answering it.

This is a question about sample variation. Each day represents a sample of

stops from all the stops that occurred during the period. The variation in stops

observed on different days might simply be due to an expected variation in

taking random samples.

There is a basic theorem in statistics, called the Central Limit Theorem,

which says the following:

As more and more samples are taken from a population, the distribution of the

averages of the samples (or a similar statistic) follows the normal distribution.

The average (what statisticians call the mean) of the samples comes arbitrarily

close to the average of the entire population.

130 Chapter 5

The Central Limit Theorem is actually a very deep theorem and quite inter

esting. More importantly, it is useful. In the case of discrete variables, such as

number of customers who stop on each day, the same idea holds. The statistic

used for this example is the count of the number of stops on each day, as

shown earlier in Figure 5.2. (Strictly speaking, it would be better to use a pro

portion, such as the ratio of stops to the number of customers; this is equiva

lent to the count for our purposes with the assumption that the number of

customers is constant over the period.)

The normal distribution is described by two parameters, the mean and the

standard deviation. The mean is the average count for each day. The standard

deviation is a measure of the extent to which values tend to cluster around the

mean and is explained more fully later in the chapter; for now, using a function

such as STDEV() in Excel or STDDEV() in SQL is sufficient. For the time series,

the standard deviation is the standard deviation of the daily counts. Assuming

that the values for each day were taken randomly from the stops for the entire

period, the set of counts should follow a normal distribution. If they don™t

follow a normal distribution, then something besides chance is affecting the

values. Notice that this does not tell us what is affecting the values, only that

the simplest explanation, sample variation, is insufficient to explain them.

This is the motivation for standardizing time series values. This process pro

duces the number of standard deviations from the average:

Calculate the average value for all days.

––

Calculate the standard deviation for all days.

––

For each value, subtract the average and divide by the standard deviation

––

to get the number of standard deviations from the average.

The purpose of standardizing the values is to test the null hypothesis. When

true, the standardized values should follow the normal distribution (with an

average of 0 and a standard deviation of 1), exhibiting several useful proper

ties. First, the standardized value should take on negative values and positive

values with about equal frequency. Also, when standardized, about two-thirds

(68.4 percent) of the values should be between minus one and one. A bit over

95 percent of the values should be between “2 and 2. And values over 3 or less

than “3 should be very, very rare”probably not visible in the data. Of course,

“should” here means that the values are following the normal distribution and

the null hypothesis holds (that is, all time related effects are explained by sam

ple variation). When the null hypothesis does not hold, it is often apparent

from the standardized values. The aside, “A Question of Terminology,” talks a

bit more about distributions, normal and otherwise.

Figure 5.3 shows the standardized values for the data in Figure 5.2. The first

thing to notice is that the shape of the standardized curve is very similar to the

shape of the original data; what has changed is the scale on the vertical dimen

sion. When comparing two curves, the scales for each change. In the previous

The Lure of Statistics: Data Mining Using Familiar Tools 131

figure, overall stops were much larger than pricing stops, so the two were

shown using different scales. In this case, the standardized pricing stops are

towering over the standardized overall stops, even though both are on the

same scale.

The overall stops in Figure 5.3 are pretty typically normal, with the follow

ing caveats. There is a large peak in December, which probably needs to be

explained because the value is over four standard deviations away from the

average. Also, there is a strong weekly trend. It would be a good idea to repeat

this chart using weekly stops instead of daily stops, to see the variation on the

weekly level.

The lighter line showing the pricing related stops clearly does not follow the

normal distribution. Many more values are negative than positive. The peak is

at over 13”which is way, way too high.

Standardized values, or z-values as they are often called, are quite useful. This

example has used them for looking at values over time too see whether the val

ues look like they were taken randomly on each day; that is, whether the varia

tion in daily values could be explained by sampling variation. On days when

the z-value is relatively high or low, then we are suspicious that something else

is at work, that there is some other factor affecting the stops. For instance, the

peak in pricing stops occurred because there was a change in pricing. The effect

is quite evident in the daily z-values.

The z-value is useful for other reasons as well. For instance, it is one way of

taking several variables and converting them to similar ranges. This can be

useful for several data mining techniques, such as clustering and neural net

works. Other uses of the z-value are covered in Chapter 17, which discusses

data transformations.

14

13

12

Standard Deviations from Mean

11

10

9

8

7

(Z-Value)

6

5

4

3

2

1

0

-1

-2

May

Jun

Jul

Aug

Sep

Oct

Nov