<<

. 113
( 137 .)



>>

Extracting Features from a Single Value
Computationally, parsing values is a very simple operation because all the
data needed is present in a single value. Even though it is so simple, it is quite
useful, as these examples show:
Calculating the day of the week from a date
––

Extracting the credit card issuer code from a credit card number
––

Taking the SCF (first three digits) of a zip code
––

Determining the vehicle manufacturer code from the VIN
––

Adding a flag when a field is missing
––


These operations generally require rudimentary operations that data mining
tools should be able to handle. Unfortunately, many statistical tools focus more
on numeric data types than on the strings, dates, and times often encountered
in business data”so string operations and date arithmetic can be difficult. In
such cases, these variables may need to be added during a preprocessing phase
or as data is extracted from data sources.


Combining Values within a Record
As with the extraction of features from a single value, combining values within
a record is computationally simple”instead of using one variable, there are
several variables. Most data mining tools support adding derived variables
that combine values from several fields, particularly for numeric fields. This
can be very useful, for adding ratios, sums, averages, and so on. Such derived
values are often more useful for modeling purposes than the raw data because
these variables start to capture underlying customer behavior. Date fields are
often combined. Taking the difference of two dates to calculate duration is an
especially common and useful example.
It is not usually necessary to combine string fields, unless the fields are some­
how related. For instance, it might be useful to combine a “credit card payment
flag” with a “credit card type,” so there is one field representing the payment type.


Looking Up Auxiliary Information
Looking up auxiliary information is a more complicated process than the pre­
vious two calculations. A lookup is an example of joining two tables together
(to use relational database terminology), with the simplifying assumption that
one table is big and the other table is relatively small.
570 Chapter 17


When the lookup table is small enough, such as Table 17.3, which describes
the mapping between initial digits of a credit card number and the credit card
type, then a simple formula can suffice for the lookup.
The more common situation is having a secondary table or file with the
information. This table might, for instance, contain:
Populations and median household incomes of zip codes (usefully pro­
––

vided for downloading for the United States by the U.S. Census Bureau
at www.census.gov)
Hierarchies for product codes

––

Store type information about retail locations

––


Unfortunately data mining tools do not, as a rule, make it easy to do lookups
without programming. Tools that do provide this facility, such as I-Miner from
Insightful, usually require that both tables be sorted by the field or fields used
for the lookup; an example of this is shown in Figure 17.12. This is palatable for
one such field, but it is cumbersome when there are many different fields to be
looked up. In general, it is easier to do these lookups outside the tool, especially
when the lookup tables and original data are both coming from databases.




Figure 17.12 Insightful Miner enables users to use and create lookup tables from the
graphical user interface.
Preparing Data for Mining 571


Sometimes, the lookup tables already exist. Other times, they must be cre­
ated as needed. For instance, one useful predictor of customer attrition is the
historical attrition rate by zip code. To add this to a customer signature
requires calculating the historical attrition rate for each zip code and then
using the result as a lookup table.

WA R N I N G When using database joins to look up values in a lookup table,
always use a left outer join to ensure that no customer rows are lost in the

process! An outer join in SQL looks like:

SELECT c.*, l.value
FROM (customer c left outer join lookup l on c.code = l.code)



Table 17.3 Credit Card Prefixes

CARD TYPE PREFIX LENGTH
MasterCard 51 16
MasterCard 52 16
MasterCard 53 16
MasterCard 54 16
MasterCard 55 16
Visa 4 13
Visa 4 16
American Express 34 15
American Express 37 15
Diners Club 300 14
Diners Club 301 14
Diners Club 302 14
Diners Club 303 14
Diners Club 304 14
Diners Club 305 14
Discover 6011 16
enRoute 2014 15
enRoute 2149 15
JCB 3 16
JCB 2131 15
JCB 1800 15
572 Chapter 17


Pivoting Regular Time Series
Data about customers is often stored at a monthly level, where each month has a
separate row of data. For instance, billing data is often stored this way, since most
subscription-based companies bill customers once a month. This data is an exam­
ple of a regular time series, because the data occurs at fixed, defined intervals.
Figure 17.13 illustrates the process needed to put this data into a customer signa­
ture. The data must be pivoted, so values that start out in rows end up in columns.
This is generally a cumbersome process, because neither data mining tools
nor SQL makes it easy to do pivoting. Data mining tools generally require pro­
gramming for pivoting. To accomplish this, the customer file needs to be sorted
by customer ID, and the billing file needs to be sorted by the customer ID and
the billing date. Then, special-purpose code is needed to calculate the pivoting




Y
columns. In SAS, proc TRANSPOSE is used for this purpose. The sidebar “Piv­




FL
oting Data in SQL” shows how it is done in SQL.
Most businesses store customer data on a monthly basis, usually by calen­
dar month. Some industries, though, show strong weekly cyclical patterns,
AM
because customers either do or do not do things over the weekend. For
instance, Web sites might be most active during weekdays, and newspaper
subscriptions generally start on Mondays or Sundays.
TE

Such weekly cycles interfere with the monthly data, because some months are
longer than others. Consider a Web site where most activity is on weekdays. Some
months have 20 weekdays; others have up to 23 (not including holidays). The dif­
ference between successive months could be 15 percent, due solely to the differ­
ence in the number of weekdays. To take this into account, divide the monthly
activity by the number of weekdays during the month, to get an “activity per
weekday.” This only makes sense, though, when there are strong weekly cycles.


CUSTOMER MON AMOUNT

Cust 1 Jan $38.43

Cust 1 Feb $41.22

Cust 1 Mar $21.09

Cust 1 Apr $66.02

Cust 2 Mar $14.36

Cust 2 Apr $9.52




CUSTOMER JAN AMOUNT FEB AMOUNT MAR AMOUNT APR AMOUNT

Cust 1 $38.43 $41.22 $21.09 $66.02

Cust 2 $14.36 $9.52


Figure 17.13 Pivoting a field takes values stored in one or more rows for each customer
and puts them into a single row for each customer, but in different columns.

Team-Fly®
Preparing Data for Mining 573


PIVOTING DATA IN SQL

SQL does not have great support for pivoting data (although some databases
may have nonstandard extensions with this capability). However, when using
standard SQL it is possible to pivot data.
Assume that the data consists of billing records and that each has a sequential
billing number assigned to it. The first billing record has a “1,” the second “2,”
and so on. The following SQL fragment shows how to pivot this data:
SELECT customer_id,
sum(case when bill_seq = 1 then bill_amt end) as bill_1,
sum(case when bill_seq = 2 then bill_amt end) as bill_2,
sum(case when bill_seq = 3 then bill_amt end) as bill_3,
...
FROM billing
GROUP BY customer_id
One problem with this fragment is that different customers have different
numbers of billing periods. However, the query can only take a fixed number.
When a customer has fewer billing periods than the query wants, then the later
periods are filled with NULLs.
Actually, this code fragment is not generally what is needed for customer
signatures because the signature wants the most recent billing periods”such as
the last 12 or 24. For customers who are active, this is the most recent period.
However, for customers who have stopped, this requires considering their stop
date instead. The following code fragment takes this into account:
SELECT customer_id,
sum(case when trunc(months_between(bill_date, cutoff) = 1
then bill_amt else 0 end) as bill_1,
sum(case when trunc(months_between(bill_date, cutoff) = 2
then bill_amt else 0 end) as bill_2,
...
FROM billing b,
(select customer_id,
(case when status = ˜ACTIVE™ then sysdate
else stop_date end) as cutoff
from customer) c
where b.customer_id = c.customer_id
GROUP BY customer_id
This code fragment does use some extensions to SQL for the date
calculations (these are expressed as Oracle functions in this example).
However, most databases have similar functions.
The above code is an example of a killer query, because it is joining a big
table (the customer table) with an even bigger table (the customer billing table)
and then doing a grouping operation. Fortunately, modern databases can take
advantage of multiple processors and multiple disks to perform this query in a
reasonable amount of time.
574 Chapter 17


Summarizing Transactional Records
Transactional records are an example of an irregular time series”that is, the
records can occur at any point in time. Such records are generated by customer
interactions, as is the case with:
Automated teller machine transactions
––

Telephone calls
––

Web site visits
––

Retail purchases
––


There are several challenges when working with irregular time series. First,
the transaction volumes are very, very large. Working with such voluminous
data requires sophisticated tools and powerful computers. Second, there is no
standard way of working with them. The regular time series data has a natural

<<

. 113
( 137 .)



>>