. 97
( 137 .)


“their” data during the loading of the central repository.
This genre of tools is often so good at processing data that we are surprised
that such tools remain embedded in IT departments and are not more gener­
ally used by data miners. Mastering Data Mining has a case study from 1998 on
using one of these tools from Ab Initio, for analyzing hundreds of gigabytes of
call detail records”a quantity of data that would still be challenging to ana­
lyze today.
488 Chapter 15

Central Repository
The central repository is the heart of the data warehouse. It is usually a rela­
tional database accessed through some variant of SQL.
One of the advantages of relational databases is their ability to run on pow­
erful, scalable machines by taking advantage of multiple processors and mul­
tiple disks (see the side bar “Background on Parallel Technology”). Most
statistical and data mining packages, for instance, can run multiple processing
threads at the same time. However, each thread represents one task, running
on one processor. More hardware does not make any given task run faster
(except when other tasks happen to be interfering with it). Relational data­
bases, on the other hand, can take a single query and, in essence, create multi­
ple threads all running at the same time for one query. As a result,
data-intensive applications on powerful computers often run more quickly
when using a relational database than when using non-parallel enabled
software”and data mining is a very data-intensive application.
A key component in the central repository is a logical data model, which
describes the structure of the data inside a database in terms familiar to busi­
ness users. Often, the data model is confused with the physical layout (or
schema) of the database, but there is a critical difference between the two. The
purpose of the physical layout is to maximize performance and to provide
information to database administrators (DBAs). The purpose of the logical
data model is to communicate the contents of the database to a wider, less
technical audience. The business user must be able to understand the logical
data model”entities, attributes, and relationships. The physical layout is an
implementation of the logical data model, incorporating compromises and
choices along the way to optimize performance.
When embarking on a data warehousing project, many organizations feel
compelled to develop a comprehensive, enterprise-wide data model. These
efforts are often surprisingly unsuccessful. The logical data model for the data
warehouse does not have to be quite as uncompromising as an enterprise-
wide model. For instance, a conflict between product codes in the logical data
model for the data warehouse can be (but not necessarily should be) resolved
by including both product hierarchies”a decision that takes 10 minutes to
make. In an enterprise-wide effort, resolving conflicting product codes can
require months of investigations and meetings.

T I P Data warehousing is a process. Be wary of any large database called a
data warehouse that does not have a process in place for updating the system
to meet end user needs. Such a data warehouse will eventually fade into
disuse, because end users needs are likely to evolve, but the system will not.
Data Warehousing, OLAP, and Data Mining 489

Parallel technology is the key to scalable hardware, and it comes in two flavors:
symmetric multiprocessing systems (SMPs) and massively parallel processing
systems (MPPs), both of which are shown in the following figure. An SMP
machine is centered on a bus, a special network present in all computers that
connects processing units to memory and disk drives. The bus acts as a central
communication device, so SMP systems are sometimes called shared
everything. Every processing unit can access all the memory and all the disk
drives. This form of parallelism is quite popular because an SMP box supports
the same applications as uniprocessor boxes”and some applications can take
advantage of additional hardware with minimal changes to code. However,
SMP technology has its limitations because it places a heavy burden on the
central bus, which becomes saturated as the processing load increases.
Contention for the central bus is often what limits the performance of SMPs.
They tend to work well when they have fewer than 10 to 20 processing units.
MPPs, on the other hand, behave like separate computers connected by a
very high-speed network, sometimes called a switch. Each processing unit has
its own memory and its own disk storage. Some nodes may be specialized
for processing and have minimal disk storage, and others may be specialized
for storage and have lots of disk capacity. The bus connecting the processing
unit to memory and disk drives never gets saturated. However, one drawback is
that some memory and some disk drives are now local and some are remote”a
distinction that can make MPPs harder to program. Programs designed for one
processor can always run on one processor in an MPP”but they require
modifications to take advantage of all the hardware. MPPs are truly scalable so
long as the network connecting the processors can supply more bandwidth,
and faster networks are generally easier to design than faster buses. There are
MPP-based computers with thousands of nodes and thousands of disks.
Both SMPs and MPPs have their advantages. Recognizing this, the vendors of
these computers are making them more similar. SMP vendors are connecting
their SMP computers together in clusters that start to resemble MPP boxes. At
the same time, MPP vendors are replacing their single-processing units with
SMP units, creating a very similar architecture. However, regardless of how
powerful the hardware is, software needs to be designed to take advantage of
these machines. Fortunately, the largest database vendors have invested years
of research into enabling their products to do so.
490 Chapter 15


A simple computer follows the
P architecture laid out by Von
Neumann. A processing unit
communicates to memory and
disk over a local bus. (Memory
stores both data and the
executable program.) The
speed of the processor, bus,
and memory limits performance
and scalability.

The symmetric multiprocessor
(SMP) has a shared-everything
P architecture. It expands the
capabilities of the bus to
support multiple processors,
more memory, and a larger disk.
The capacity of the bus limits
performance and scalability.
SMP architectures usually max
out with fewer than 20
processing units.

The massively parallel
processor (MMP) has a shared-
nothing architecture. It
introduces a high-speed
network (also called a switch)
that connects independent
M M processor/memory/disk
components. MPP
architectures are very scalable
P P but fewer software packages
can take advantage of all the

Parallel computers build on the basic Von Neumann uniprocessor architecture. SMP
and MPP systems are scalable because more processing units, disk drives, and
memory can be added to the system.

Data warehousing is a process for managing the decision-support system of
record. A process is something that can adjust to users™ needs as they are clari­
fied and change over time. A process can respond to changes in the business as
needs change over time. The central repository itself is going to be a brittle,
little-used system without the realization that as users learn about data and
about the business, they are going to want changes and enhancements on the
Data Warehousing, OLAP, and Data Mining 491

time scale of marketing (days and weeks) rather than on the time scale of IT

Metadata Repository
We have already discussed metadata in the context of the data hierarchy. It can
also be considered a component of the data warehouse. As such, the metadata
repository is an often overlooked component of the data warehousing envi­
ronment. The lowest level of metadata is the database schema, the physical
layout of the data. When used correctly, though, metadata is much more. It
answers questions posed by end users about the availability of data, gives
them tools for browsing through the contents of the data warehouse, and gives
everyone more confidence in the data. This confidence is the basis for new
applications and an expanded user base.
A good metadata system should include the following:
The annotated logical data model. The annotations should explain the

entities and attributes, including valid values.
Mapping from the logical data model to the source systems.

The physical schema.

Mapping from the logical model to the physical schema.

Common views and formulas for accessing the data. What is useful to

one user may be useful to others.
Load and update information.

Security and access information.

Interfaces for end users and developers, so they share the same descrip­

tion of the database.
In any data warehousing environment, each of these pieces of information is
available somewhere”in scripts written by the DBA, in email messages, in
documentation, in the system tables in the database, and so on. A metadata
repository makes this information available to the users, in a format they can
readily understand. The key is giving users access so they feel comfortable with
the data warehouse, with the data it contains, and with knowing how to use it.

Data Marts
Data warehouses do not actually do anything (except store and retrieve data
effectively). Applications are needed to realize value, and these often take the
form of data marts. A data mart is a specialized system that brings together
the data needed for a department or related applications. Data marts are often
used for reporting systems and slicing-and-dicing data. Such data marts
often use OLAP technology, which is discussed later in this chapter. Another
492 Chapter 15

important type of data mart is an exploratory environment used for data
mining, which is discussed in the next chapter.
Not all the data in data marts needs to come from the central repository.
Often specific applications have an exclusive need for data. The real estate
department, for instance, might be using geographic information in combina­
tion with data from the central repository. The marketing department might be
combining zip code demographics with customer data from the central repos­
itory. The central repository only needs to contain data that is likely to be
shared among different applications, so it is just one data source”usually the
dominant one”for data marts.

Operational Feedback

Operational feedback systems integrate data-driven decisions back into the

operational systems. For instance, a large bank may develop cross-sell models
to determine what product next to offer a customer. This is a result of a data
mining system. However, to be useful this information needs to go back into
the operational systems. This requires a connection back from the decision-
support infrastructure into the operational infrastructure.
Operational feedback offers the capability to complete the virtuous cycle of

data mining very quickly. Once a feedback system is set up, intervention is
only needed for monitoring and improving it”letting computers do what
they do best (repetitive tasks) and letting people do what they do best (spot
interesting patterns and come up with ideas). One of the advantages of Web-
based businesses is that they can, in theory, provide such feedback to their
operational systems in a fully automated way.


. 97
( 137 .)