. 101
( 137 .)


B New York/NJ 000005 N 000005 Sun

C Mid Atlantic
D North Central
E Southeast
Figure 15.9 A star schema looks more like this. Dimension tables are conceptually nested,
and there may be more than one dimension table for a given dimension.
Data Warehousing, OLAP, and Data Mining 507

In practice, star schemas may not be efficient for answering all users™ ques­
tions, because the central fact table is so large. In such cases, the OLAP systems
introduce summary tables at different levels to facilitate query response. Rela­
tional database vendors have been providing more and more support for star
schemas. With a typical architecture, any query on the central fact table would
require multiple joins back to the dimension tables. By applying standard
indexes, and creatively enhancing indexing technology, relational databases
can handle these queries quite well.

OLAP and Data Mining
Data mining is about the successful exploitation of data for decision-support
purposes. The virtuous cycle of data mining, described in Chapter 2, reminds
us that success depends on more than advanced pattern recognition algo­
rithms. The data mining process needs to provide feedback to people and
encourage using information gained from data mining to improve business
processes. The data mining process should enable people to provide input, in
the form of observations, hypotheses, and hunches about what results are
important and how to use those results.
In the larger context of data exploitation, OLAP clearly plays an important
role as a means of broadening the audience with access to data. Decisions once
made based on experience and educated guesses can now be based on data
and patterns in the data. Anomalies and outliers can be identified for further
investigation and further modeling, sometimes using the most sophisticated
data mining techniques. For instance, a user might discover that a particular
item sells better at a particular time during the week through the use of an
OLAP tool. This might lead to an investigation using market basket analysis to
find other items purchased with that item. Market basket analysis might sug­
gest an explanation for the observed behavior”more information and more
opportunities for exploiting the information.
There are other synergies between data mining and OLAP. One of the char­
acteristics of decision trees discussed in Chapter 6 is their ability to identify the
most informative features in the data relative to a particular outcome. That is,
if a decision tree is built in order to predict attrition, then the upper levels of
the tree will have the features that are the most important predictors for attri­
tion. Well, these predictors might be a good choice for dimensions using an
OLAP tool. Such analysis helps build better, more useful cubes. Another prob­
lem when building cubes is determining how to make continuous dimensions
discrete. The nodes of a decision tree can help determine the best breaking
point for a continuous value. This information can be fed into the OLAP tool
to improve the dimension.
508 Chapter 15

One of the problems with neural networks is the difficulty of understanding
the results. This is especially true when using them for undirected data min­
ing, as when using SOM networks to detect clusters. The SOM identifies clus­
ters, but cannot explain what the clusters mean.
OLAP to the rescue! The data can now be enhanced with a predicted clus­
ter, as well as with other information about customers, such as demographics,
purchase history, and so on. This is a good application for a cube. Using
OLAP”with information about the clusters included as a dimension”makes
it possible for end users to explore the clusters and to determine features that
distinguish them. The dimensions used for the OLAP cube should include the
inputs to the SOM neural network, along with the cluster identifier, and per­
haps other descriptive variables. There is a tricky data conversion problem
because the neural networks require continuous values scaled between “1 and
1, and OLAP tools prefer discrete values. For values that were originally dis­
crete, this is no problem. For continuous values, various binning techniques
solve the problem.
As these examples show, OLAP and data mining complement each other.
Data mining can help build better cubes by defining appropriate dimensions,
and further by determining how to break up continuous values on dimen­
sions. OLAP provides a powerful visualization capability to help users better
understand the results of data mining, such as clustering and neural networks.
Used together, OLAP and data mining reinforce each other™s strengths and
provide more opportunities for exploiting data.

Where Data Mining Fits in with Data Warehousing
Data mining plays an important role in the data warehouse environment. The
initial returns from a data warehouse come from automating existing
processes, such as putting reports online and giving existing applications a
clean source of data. The biggest returns are the improved access to data that
can spur innovation and creativity”and these come from new ways of look­
ing at and analyzing data. This is the role of data mining”to provide the tools
that improve understanding and inspire creativity based on observations in
the data.
A good data warehousing environment serves as a catalyst for data mining.
The two technologies work together as partners:
Data mining thrives on large amounts of data and the more detailed the

data, the better”data that comes from a data warehouse.
Data mining thrives on clean and consistent data”capitalizing on the

investment in data cleansing tools.
Data Warehousing, OLAP, and Data Mining 509

The data warehouse environment enables hypothesis testing and sim­

plifies efforts to measure the effects of actions taken”enabling the vir­
tuous cycle of data mining.
Scalable hardware and relational database software can offload the data

processing parts of data mining.
There is, however, a distinction between the way data mining looks at the
world and the way data warehousing does. Normalized data warehouses
can store data with time stamps, but it is very difficult to do time-related
manipulations”such as determining what event happened just before some
other event of interest. OLAP introduces a time dimension. Data mining
extends this even further by taking into account the notion of “before” and
“after.” Data mining learns from data (the “before”), with the purpose of
applying these findings to the future (the “after”). For this reason, data mining
often puts a heavy load on data warehouses. These are complementary tech­
nologies, supporting each other as discussed in the next few sections.

Lots of Data
The traditional approach to data analysis generally starts by reducing the size
of the data. There are three common ways of doing this: summarizing detailed
transactions, taking a subset of the data, and only looking at certain attributes.
The reason for reducing the size of the data was to make it possible to analyze
the data on the available hardware and software systems. When properly
done, the laws of statistics come into play, and it is possible to choose a sample
that behaves roughly like the rest of the data.
Data mining, on the other hand, is searching for trends in the data and for
valuable anomalies. It is often trying to answer different types of questions
from traditional statistical analysis, such as “what product is this customer
most likely to purchase next?” Even if it is possible to devise a model using a
subset of data, it is necessary to deploy the model and score all customers, a
process that can be very computationally intensive.
Fortunately, data mining algorithms are often able to take advantage of
large amounts of data. When looking for patterns that identify rare events”
such as having to write-off customers because they failed to pay”having large
amounts of data ensures that there is sufficient data for analysis. A subset of
the data might be statistically relevant in total, but when you try to decompose
it into other segments (by region, by product, by customer segment), there
may be too little data to produce statistically meaningful results.
Data mining algorithms are able to make use of lots of data. Decision trees,
for example, work very well, even when there are dozens or hundreds of fields
in each record. Link analysis requires a full complement of the data to create a
510 Chapter 15

graph. Neural networks can train on millions of records at a time. And, even
though the algorithms often work on summaries of the detailed transactions
(especially at the customer level), what gets summarized can change from one
run to the next. Prebuilding the summaries and discarding the transaction
data locks you into only one view of the business. Often the first result from
using such summaries is a request for some variation on them.

Consistent, Clean Data
Data mining algorithms are often applied to gigabytes of data combined from
several different sources. Much of the work in looking for actionable informa­
tion actually takes place when bringing the data together”often 80 percent or
more of the time allocated to a data mining project is spent bringing the data
together”especially when a data warehouse is not available. Subsequent
problems, such as matching account numbers, interpreting codes, and house-
holding, further delay the analysis. Finding interesting patterns is often an
iterative process that requires going back to the data to get additional data ele­
ments. Finally, when interesting patterns are found, it is often necessary to
repeat the process on the most recent data available.
A well-designed and well-built data warehouse can help solve these prob­
lems. Data is cleaned once, when it is loaded into the data warehouse. The
meaning of fields is well defined and available through the metadata. Incor­
porating new data into analyses is as easy as finding out what data is available
through the metadata and retrieving it from the warehouse. A particular
analysis can be reapplied on more recent data, since the warehouse is kept up
to date. The end result is that the data is cleaner and more available”and that
the analysts can spend more time applying powerful tools and insights instead
of moving data and pushing bytes.

Hypothesis Testing and Measurement
The data warehouse facilitates two other areas of data mining. Hypothesis
testing is the verification of educated guesses about patterns in the data. Do
tropical colors really sell better in Florida than elsewhere? Do people tend
to make long-distance calls after dinner? Are the users of credit cards at
restaurants really high-end customers? All of these questions can be expressed
rather easily as queries on the appropriate relational database. Having the data
available makes it possible to ask questions and find out quickly what the
answers are.

T I P The ability to test hypotheses and ideas is a very important aspect of data
mining. By bringing the data together in one place, data warehouses enable
answering in-depth, complicated questions. One caveat is that such queries can
be expensive to run, falling into the killer query category.
Data Warehousing, OLAP, and Data Mining 511

Measurement is the other area where data warehouses have proven to be
very valuable. Often when marketing efforts, product improvements, and so
forth take place, there is limited feedback on the degree of success achieved. A
data warehouse makes it possible to see the results and to find related effects.
Did sales of other products improve? Did customer attrition increase? Did
calls to customer service decrease? And so on. Having the data available
makes it possible to understand the effects of an action, whether the action was
spurred by data mining results or by something else.
Of particular value in terms of measurement is the effect of various market­
ing actions on the longer-term customer relationship. Often, marketing cam­
paigns are measured in terms of response. While response is clearly a
dimension of interest, it is only one. The longer term behavior of customers is
also of interest. Did an acquisition campaign bring in good customers or did
the newly acquired customers leave before they even paid? Did an upsell cam­
paign stick, or did customers return to their previous products? Measurement
enables an organization to learn from its mistakes and to build on its successes.

Scalable Hardware and RDBMS Support
The final synergy between data mining and data warehousing is on the sys­
tems level. The same scalable hardware and software that makes it possible to
store and query large databases provides a good system for analyzing data.
Chapter 17 talks about building the customer signature. Often, the best place
to build the signature is in the central repository or, failing that, in a data mart
with similar amounts of data.
There is also the question of running data mining algorithms in parallel, tak­
ing further advantage of the powerful machines. This is often not necessary,
because actually building models represents a small part of the time devoted
to data mining”preparing the data and understanding the results are much
more important. Databases, such as Oracle and Microsoft SQL Server, are
increasingly providing support for data mining algorithms, which enables
such algorithms to run in parallel.

Lessons Learned
Data warehousing is not a system but a process that can greatly benefit data
mining and data analysis efforts. From the perspective of data mining, the
most important functionality is the ability to recreate accurate snapshots of
history. Another very important facet is support for ad hoc reporting. In order
to learn from data, you need to know what really happened.
512 Chapter 15

A typical data warehousing system contains the following components:
The source systems provide the input into the data warehouse.

The extraction , transformation, and load tools clean the data and apply

business rules so that new data is compatible with historical data.
The central repository is a relational database specifically designed to

be a decision-support system of record.
The data marts provide the interface to different varieties of users with

different needs.
The metadata repository informs users and developers about what is

inside the data warehouse.

One of the challenges in data warehousing is the massive amount of data
that must be stored, particularly if the goal is to keep all customer interactions.


. 101
( 137 .)