to summarize the data.

One way is to transform the irregular time series into regular time series and

then to pivot the series. For instance, calculate the number of calls per month

or the amount withdrawn from ATMs each month, and then pivot the sums by

month. When working with transactions, these calculations can be more com

plex, such as the number of calls longer than 10 minutes or the number of

withdrawals less than $50. These specialized summaries can be quite useful.

More complicated examples that describe customer behavior are provided just

after the next section.

Another approach is to define a set of data transformations that are run on

the transactional data as it is being collected. This is an approach taken in the

telecommunications industry, where the volume of data is vast. Some vari

ables may be as simple as minutes of use, others may be a complex as a score

for whether the calling number is a business or residence. This approach hard-

codes the calculations, and such calculations are hard to change. Although

such variables can be useful, a more flexible environment for summarizing

transactional data is strategically more useful.

Summarizing Fields across the Model Set

The last method for deriving variables is summarizing values across fields in

the customer signature itself. There are several examples of such fields:

Binning values into equal sized bins requires calculating the breakpoints

––

for the bins.

Standardizing a value (subtracting the mean and dividing by the stan

––

dard deviation) requires calculating the mean and standard deviation

for the field and then doing the calculation.

Preparing Data for Mining 575

Ranking a value (so the smallest value has a value of 1, the second

––

smallest 2, and so on) requires sorting all the values to get the ranking.

Although these are complicated operations, they are performed directly on

the model set. Data mining tools provide support for these operations, espe

cially for binning numeric values, which is the most important of the three.

One type of binning that would be very useful is not readily available. This

is binning for codes based on frequency. That is, it would be useful to keep all

codes that have at least, say, 1,000 instances in the model set and to place all

other codes in a single “other” category. This is useful for working with out

liers, such as the many old and unpopular handsets that show up in mobile

telephone data although few customers use them. One way to handle this is to

identify the handsets to keep and to add a new field “handset for analysis”

that keeps these handsets and places the rest into an “other” category. A more

automated way is to create a lookup table to map the handsets. However, per

haps a better way is to replace the handset ID itself with information such as

the date the handset was released, its weight, and the features it uses”infor-

mation that is probably available in a lookup table already.

Examples of Behavior-Based Variables

The real power of derived variables comes from the ability to summarize cus

tomer behaviors along known dimensions. This section builds on the ideas

already presented and gives three examples of useful behavior-based variables.

Frequency of Purchase

Once upon a time, catalogers devised a clever method for characterizing cus

tomer behavior using three dimensions”recency, frequency, and monetary

value. RFM, which relies on these three variables, has been used at least since the

1970s. Of these three descriptions of customer behavior, recency is usually the

most predictive, but frequency is the most interesting. Recency simply means

the length of time since a customer made a purchase. Monetary value is tradi

tionally the total amount purchased (although we have found the average pur

chase value more useful since the total is highly correlated with frequency).

In traditional RFM analysis, frequency is just the number of purchases.

However, a simple count does not do a good job of characterizing customer

behavior. There are other approaches to determining frequency, and these can

be applied to other areas not related to catalog purchasing”frequency of com

plaints, frequency of making international telephone calls, and so on. The

important point is that customers may perform an action at irregular intervals,

and we want to characterize this behavior pattern because it provides poten

tially useful information about customers.

576 Chapter 17

One method of calculating frequency would be to take the length of time

indicated by the historical data and divide it by the number of times the cus

tomer made a purchase. So, if the catalog data goes back 6 years and a cus

tomer made a single purchase, then that frequency would be once every 6

years.

Although simple, this approach misses an important point. Consider two

customers:

John made a purchase 6 years ago and has received every catalog since

––

then.

Mary made a purchase last month when she first received the catalog.

––

Does it make sense that both these customers have the same frequency? No.

John more clearly has a frequency of no more than once every 6 years. Mary

only had the opportunity to make one purchase in the past month, so her fre

quency would more accurately be described as once per month. The first point

about frequency is that it should be measured from the first point in time that

a customer had an opportunity to make a purchase.

There is another problem. What we really know about John and Mary is that

their frequencies are no more than once every 6 years and no more than once

per month, respectively. Historically, one observation does not contain enough

information to deduce a real frequency. This is really a time to event problem,

such as those discussed in Chapter 12.

Our goal here is to characterize frequency as a derived variable, rather than

predict the next event (which is best approached using survival analysis). To

do this, let™s assume that there are two or more events, so the average time

between events is the total span of time divided by the number of events

minus one, as shown in Figure 17.14. This provides the average time between

events for the period when the events occurred.

There is no perfect solution to the question of frequency, because customer

events occur irregularly and we do not know what will happen in the future”

the data is censored. Taking the time span from the first event to the most

recent event runs into the problem that customers whose events all took place

long ago may have a high frequency. The alternative is to take the time since

the first event, in essence pretending that the present is an event. This is unsat

isfying, because the next event is not known, and care must be taken when

working with censored data. In practice, taking the number of events since the

first event could have happened and dividing by the total span of time (or the

span when the customer was active) is the best solution.

Preparing Data for Mining 577

Purchase

Current

First

Time

Contact

Time

A B C D

Frequency is 2 / (C “ A), but does not include

time after C

Frequency is 3 / C, but does not include time after C

Frequency is 3 / (D - A), but data is censored

Frequency is 3/D, but data is censored

Figure 17.14 There is no perfect way to estimate frequency, but these four ways are all

reasonable.

Declining Usage

In telecommunications, one significant predictor of churn is declining usage”

customers who use services less and less over time are more likely to leave

than other customers. Customers who have declining usage are likely to have

many variables indicating this:

Billing measures, such as recent amounts spent are quite small.

––

Usage measures, such as recent amounts used are quite small or always

––

at monthly minimums.

Optional services recently have no usage.

––

Ratios of recent measures to older measures are less than 1, often signif

––

icantly less than one, indicating recent usage is smaller than historical

usage.

The existence of so many different measures for the same underlying behav

ior suggests a situation where a derived variable might be useful to capture the

behavior in a single variable. The goal is to incorporate as much information as

possible into a “declining usage” indicator.

578 Chapter 17

T I P When many different variables all suggest a single customer behavior,

then it is likely that a derived variable that incorporates this information will do

a better job for data mining.

Fortunately, mathematics provides an elegant solution, in the form of the

best fit line, as shown in Figure 17.15. The goodness of fit is described by the R2

statistic, which varies from 0 to 1, with values near 0 being poor fit and values

near 1 being very good. The slope of the line provides the average rate of

increase or decrease in some variable over time. In statistics, this slope is called

the beta function and is calculated according to the following formula:

Sum of (x-average(x))*(y-average(y)) / sum((x-average(x))2)

To give an example of how this might be used, consider the following data

for the customer shown in the previous figure. Table 17.4 walks through the

calculation for a typical customer.

Table 17.4 Example of Calculating the Slope for a Time Series

MONTH X “ AVG(X) (X “ AVG Y (FROM Y“ (X “ AVG(X)) *

(X “VALUE) (X))^2 CUST A) AVG(Y) (Y “ AVG( Y))

1 “5.5 30.25 53.47 3.19 “17.56

2 “4.5 20.25 46.61 “3.67 16.52

3 “3.5 12.25 47.18 “3.10 10.84

4 “2.5 6.25 49.54 “0.74 1.85

5 “1.5 2.25 48.71 “1.57 2.35

6 “0.5 0.25 52.04 1.76 “0.88

7 0.5 0.25 48.45 “1.83 “0.91

8 1.5 2.25 54.16 3.88 5.83

9 2.5 6.25 54.47 4.19 10.47

10 3.5 12.25 53.69 3.42 11.95

11 4.5 20.25 45.93 “4.35 “19.59

12 5.5 30.25 49.10 “1.18 “6.51

TOTAL 143 14.36

SLOPE 0.1004

Preparing Data for Mining 579

56

54

52

y = 0.1007x + 49.625

50

R2 = 0.0135

48

46

44

1 2 3 4 5 6 7 8 9 10 11 12

Figure 17.15 The slope of the line of best fit provides a good measure of

changes over time.

This example shows a very typical use for calculating the slope”finding the

slope over the previous year™s usage or billing patterns. The tabular format

shows the calculation in a way most suitable for a spreadsheet. However,

many data mining tools provide a function to calculate beta values directly

from a set of variables in a single row. When such a function is not available, it

is possible to express it using more basic arithmetic functions.

Although monthly data is often the most convenient for such calculations,

remember that different months have different numbers of days. This issue is

particularly significant for businesses that have strong weekly cycles. Some

months have five full weekends, for instance, while others only have four. Dif

ferent months have between 20 and 23 working days (not including holidays).

These differences can account for up to 25 percent of the difference between