<<

. 94
( 137 .)



>>

steps of identifying, acquiring, and cleansing the data. A well-designed corpo­
rate data warehouse is a valuable ally. Better yet, if the design of the data ware­
house includes support for data mining applications, the warehouse facilitates
and catalyzes data mining efforts. The two technologies work together to
deliver value. Data mining fulfills some of the promise of data warehousing by
converting an essentially inert source of clean and consistent data into action­
able information.
There is also a technological component to this relationship. Apart from the
ability of users to run multiple jobs at the same time, most software, including
data mining and statistical software, does not take advantage of the multiple
processors and multiple disks available on the fastest servers. Relational data­
base management systems (RDBMS), the heart of most data warehouses, are
parallel-enabled and can take advantage of all of a system™s resources for pro­
cessing a single query. Even more importantly, users do not need to be aware
of this fact, since the interface, some variant on SQL, remains the same. A data­
base running on a powerful server can be a powerful asset for processing large
amounts of data, as is the case when summarizing transactions at the customer
level.
As useful as data warehousing is, such systems are not prerequisite for data
mining and data analysis. Statisticians, actuaries, and analysts have been using
statistical packages for decades”and achieving good results with their analyses”
without the benefit of a well-designed centralized warehouse. This process can
continue to be useful. Because of the need for consistent, accurate, and timely data
to support business units, data warehousing has become increasingly important
for any kind of decision support or information analysis.
This chapter is focused on data warehousing as part of the virtuous cycle
of data mining, as a valuable and often critical component in supporting all
four phases of the cycle: identifying opportunities, analyzing data, applying
information, and measuring results. It is not a how-to guide for building a
warehouse”there are many books already devoted to that subject, and we
heartily recommend Ralph Kimball™s The Data Warehouse Toolkit (Wiley, 2002)
and Bill Inmon™s Building the Data Warehouse (Wiley, 2002).
Data Warehousing, OLAP, and Data Mining 475


The chapter starts with a discussion of the different types of data that are
available, and then discusses data warehousing requirements from the per­
spective of data mining. It then shows a typical data warehousing architecture
and variants on this theme. The chapter next turns to Online Analytic Process­
ing (OLAP), an alternative approach to the normalized data warehouse. The
final discussion covers the role of data mining in these environments. As with
much that has to do with data mining, however, the place to start is with data.


The Architecture of Data
There are many different flavors of information represented on computers.
Different levels of data represent different types of abstraction, as shown in
Figure 15.1.
Transaction data

––


Operational summary data

––


Decision-support summary data

––


Schema

––


Metadata

––


Business rules

––




What's been learned from
Business
the data
rules

Logical model and mappings to
Abstraction Level




Metadata physical layout and sources

Physical layout of the data,
Database schema
tables, fields, indexes, types
decision support
Summaries by who, what,
Summary data
where, when
operational


Operational data Who, what, where, and when

Data Size
Figure 15.1 A hierarchy of data and its descriptions helps users navigate around a data
warehouse. As data gets more abstract, it generally gets less voluminous.
476 Chapter 15


The level of abstraction is an important characteristic of data used in data
mining. In a well-designed system, it should be possible to drill down through
these levels of abstraction to obtain the base data that supports a summariza­
tion or a business rule. The lower levels of the pyramid are more voluminous
and tend to be the stuff of databases. The upper levels are smaller and tend to
be the stuff of computer programs. All these levels are important, because we
do not want to analyze the detailed data to merely produce what should
already be known.


Transaction Data, the Base Level
Every product purchased by a customer, every bank transaction, every Web
page visit, every credit card purchase, every flight segment, every package,
every telephone call is recorded in some operational system. Every time a new
customer opens an account or pays a bill, there should be a record of the trans­
action somewhere, providing information about who, what, where, when, and
how much. Such transaction-level data is the raw material for understanding
customer behavior. It is the eyes and ears of the enterprise.
Unfortunately, over time operational systems change because of changing
business needs. Fields may change their meaning over time. Important data is
simply rolled off and deleted. Change is constant, in response to the introduc­
tion of new products, expanding numbers of customers, acquisitions, reorga­
nizations, and new technology. The fact that operational data changes over
time has to be part of any robust data warehousing approach.

T I P Data warehouses need to store data so the information is compatible
over time, even when product lines change, when markets change, when

customer segments change, when business organizations change. Otherwise,

data mining is likely to pick up patterns that represent these changes, rather

than underlying customer behavior.


The amount of data gathered from transactional systems can be enormous.
A single fast food restaurant sells hundreds of thousands of meals over the
course of a year. A chain of supermarkets can have tens or hundreds of thou­
sands of transactions a day. A large bank processes millions of checks and
credit card purchases a day. Large Web sites have millions of hits each day (in
2003, Google was already handling over 250 million searches each day). A tele­
phone company has tens or even hundreds of millions of completed calls
every day. A large ad server on the Web keeps track of over a billion ad views
every day. Even with the price of disk space falling, storing all these transac­
tions requires a significant investment. For reference, it is worth remembering
that a day has 86,400 seconds, so a million transactions a day is really an aver­
age of about 12 transactions per second all day (and 250 million searches
Data Warehousing, OLAP, and Data Mining 477


amounts to close to 3,000 searches per second!)”with peaks several times
higher.
Because of the large data volumes, there is often a reluctance to store
transaction-level data in a data warehouse. From the perspective of data min­
ing, this is a shame, since the transactions best describe customer behavior.


Operational Summary Data
Operational summaries play the same role as transactions; the difference being
that operational summaries are derived from transactions. The most common
examples are billing systems, which summarize transactions, usually into
monthly or four-week bill cycles. These summaries are customer-facing and
often result in other transactions, such as bill payments. In some cases, opera­
tional summaries may include fields that are summarized to enhance the com-
pany™s understanding of its customers rather than for operational purposes.
For instance, Chapter 4 described how AT&T used call detail records to calcu­
late a “bizocity” score, indicating how businesslike a telephone number™s call­
ing pattern appears. The records of each call are discarded, but the score is
kept up to date.
There is a distinction between operational summary data and transaction
data, because summaries are for a period of time and transactions represent
events. Consider the amount paid by a subscription customer. In a billing sys­
tem, amount paid is a summary for the billing period. A payment history table
instead provides detail on every payment transaction. For most customers, the
monthly summary and payment transactions are very similar. However, two
payments might arrive during the same billing period. The more detailed pay­
ment information might be useful for insight into customer payment patterns.


Decision-Support Summary Data
Decision-support summary data is the data used for making decisions about
the business. The financial data used to run a company provides an example of
decision-support summary data; this is often considered to be the cleanest
data for decision making. Another example is the data warehouses and data
marts whose purpose is to provide a decision-support system of record at the
customer level. Maintaining decision-support summary data is the purpose of
the data warehouse.
Generally, it is a bad idea to use the same system for analytic and opera­
tional purposes, since operational purposes need to take precedence, resulting
in a system that is optimized for operations and not decision support. Finan­
cial systems are not generally designed for understanding customers, because
they are designed for accounting purposes. Making customer summaries bal­
ance exactly to the general ledger is highly complex and usually not worth the
478 Chapter 15


effort. One of the goals of data warehousing is to provide consistent defini­
tions and layouts so similar reports produce similar results, no matter which
business user is producing them or when they are produced. This chapter is
mostly concerned with this level of abstraction.
In one sense, summaries seem to destroy information as they aggregate
things. For this reason, different summaries are useful for different purposes.
Point-of-sale transactions may capture every can of sardines that goes over the
scanner, but only summaries begin to describe the shopper™s behavior in terms
of her habitual time of day to shop and the proportion of her dollars spent in
the canned food department. In this case, the customer summary seems to be
creating information.

WA R N I N G Do not expect customer-level data warehouse information to
balance exactly against financial systems (although the two systems should be
close). Although theoretically possible, such balancing can prove very difficult
and distract from the purpose of the data warehouse.




Database Schema
So far, the discussion has been on data. The structure of data is also important”
what data is stored, where it is stored, what is not stored, and so on. The side­
bar “What is a relational database?” explains the key ideas behind relational
databases, the most common systems for storing large amounts of data.
No matter how the data is stored, it is important to distinguish between two
ways of describing the storage. The physical schema describes the layout in the
technical detail needed by the underlying software. An example is the “CREATE
TABLE” statement in SQL. A logical schema, on the other hand, describes the data
in a way more accessible to end users. The two are not necessarily the same, nor
even similar, as shown in Figure 15.2.

WA R N I N G The existence of fields in a database does not mean that the data
is actually present. It is important to understand every field used for data

mining, and not to assume that a field populated correctly just because it

exists. Skepticism is your ally.


An analogy might help to understand the utility of the physical and logical
schemas. The logical schema describes things in a way that is familiar to busi­
ness users. This would be analogous to saying that a house is ranch style, with
four bedrooms, three baths, and a two-car garage. The physical schema goes
Data Warehousing, OLAP, and Data Mining 479


into more detail about how it is laid out. The foundation is reinforced concrete,
4 feet deep; the slab is 1,500 square feet; the walls are concrete block; and so on.
The details of construction, although useful and complete, may not help a fam­
ily find the right house.


Logical Model
This logical model has four entities, three
for customer-generated events and one for
COMPLAINT accounts.
ACCT_ID
COMPLAINT_CODE The logical model is intended to be
REFUND_AMOUNT understood by business users.
...


<<

. 94
( 137 .)



>>