. 98
( 137 .)


End Users and Desktop Tools
The end users are the final and most important component in any data ware­
house. A system that has no users is not worth building. These end users are
analysts looking for information, application developers, and business users
who act on the information.

Analysts want to access as much data as possible to discern patterns and cre­
ate ad hoc reports. They use special-purpose tools, such as statistics packages,
data mining tools, and spreadsheets. Often, analysts are considered to be the
primary audience for data warehouses.
Usually, though, there are just a few technically sophisticated people who
fall into this category. Although the work that they do is important, it is diffi­
cult to justify a large investment based on increases in their productivity. The
virtuous cycle of data mining comes into play here. A data warehouse brings

Data Warehousing, OLAP, and Data Mining 493

together data in a cleansed, meaningful format. The purpose, though, is to
spur creativity, a very hard concept to measure.
Analysts have very specific demands on a data warehouse:
The system has to be responsive. Too much of the work of analysis is in

the form of answering urgent questions in the form of ad hoc analysis
or ad hoc queries.
Data needs to be consistent across the database. That is, if a customer

started on a particular date, then the first occurrence of a product, chan­
nel, and so on should be exactly on that date.
Data needs to be consistent across time. A field that has a particular

meaning now should have the same meaning going back in time. At the
very least, differences should be well documented.
It must be possible to drill down to customer level and preferably to the

transaction level detail to verify values in the data warehouse and to
develop new summaries of customer behavior.
Analysts place a heavy load on data warehouses, and need access to consis­
tent information in a timely manner.

Application Developers
Data warehouses usually support a wide range of applications (in other
words, data marts come in many flavors). In order to develop stable and
robust applications, developers have some specific needs from the data ware­
First, the applications they are developing need to be shielded from changes
in the structure of the data warehouse. New tables, new fields, and reorganiz­
ing the structure of existing tables should have a minimal impact on existing
applications. Special application-specific views on the data help provide this
assurance. In addition, open communication and knowledge about what appli­
cations use which attributes and entities can prevent development gridlock.
Second, the developers need access to valid field values and to know what
the values mean. This is the purpose of the metadata repository, which pro­
vides documentation on the structure of the data. By setting up the application
to verify data values against expected values in the metadata, developers can
circumvent problems that often appear only after applications have rolled out.
The developers also need to provide feedback on the structure of the data
warehouse. This is one of the principle means of improving the warehouse, by
identifying new data that needs to be included in the warehouse and by fixing
problems with data already loaded. Since real business needs drive the devel­
opment of applications, understanding the needs of developers is important to
ensure that a data warehouse contains the data it needs to deliver business
494 Chapter 15

The data warehouse is going to change and applications are going to con­
tinue to use it. The key to delivering success is controlling and managing the
changes. The applications are for the end users. The data warehouse is there to
support their data needs”not vice versa.

Business Users
Business users are the ultimate devourers of information derived from the
corporate data warehouse. Their needs drive the development of applications,
the architecture of the warehouse, the data it contains, and the priorities for
Many business users only experience the warehouse through printed
reports, static online reports, or spreadsheets”basically the same way they
have been gathering information for a long time. Even these users will experi­
ence the power of having a data warehouse as reports become more accurate,
more consistent, and easier to produce.
More important, though, are the people who use the computers on their
desks and are willing to take advantage of direct access to the data warehous­
ing environment. Typically, these users access intermediate data marts to sat­
isfy the vast majority of their information needs using friendly, graphical tools
that run in their familiar desktop environment. These tools include off-the-
shelf query generators, custom applications, OLAP interfaces, and report gen­
eration tools. On occasion, business users may drill down into the central
repository to explore particularly interesting things they find in the data. More
often, they will contact an analyst and have him or her do the heavier analytic
Business users also have applications built for specific purposes. These
applications may even incorporate some of the data mining techniques dis­
cussed in previous chapters. For instance, a resource scheduling application
might include an engine that optimizes the schedule using genetic algorithms.
A sales forecasting application may have built-in survival analysis models.
When embedded in an application, the data mining algorithms are usually
quite well hidden from the end user, who cares more about the results than the
algorithms that produced them.

Where Does OLAP Fit In?
The business world has been generating automated reports to meet business
needs for many decades. Figure 15.4 shows a range of common reporting
Data Warehousing, OLAP, and Data Mining 495

capabilities. The oldest manual methods are the mainframe report-generation
tools whose output is traditionally printed on green bar paper or green
screens. These mainframe reports automate paper-based methods that pre­
ceded computers. Producing such reports is often the primary function of IS
departments. Even minor changes to the reports require modifying code that
sometimes dates back decades. The result is a lag between the time when a
user requests changes and the time when he or she sees the new information
that is measured in weeks and months. This is old technology that organiza­
tions are generally trying to move away from, except for the lowest-level
reports that summarize specific operational systems.

The source of the data is
usually legacy mainframe
systems used for operations,
but it could be a data

Using processes, often too
cumbersome to
understand and too old to
change, operational data is
extracted and summarized.

Paper-based reports from
mainframe systems are
part of the business
OLAP tools, based on multi
process. They are usually
dimensional cubes, give users
too late and too inflexible
flexible and fast access to
for decision support. Off-the-shelf query tools
data, both summarized and
provide users some access to
the data and the ability to form
their own queries.
Figure 15.4 Reporting requirements on operational systems are typically handled the
same way they have been for decades. Is this the best way?
496 Chapter 15

In the middle are off-the-shelf query generation packages that have become
popular for accessing data in the past decade. These generate queries in SQL
and can talk to local or remote data sources using a standard protocol, such as
the Open Database Connectivity (ODBC) standard. Such reports might be
embedded in a spreadsheet, accessed through the Web, or through some other
reporting interface. With a day or so of training, business analysts can usually
generate the reports that they need. Of course, the report itself is often running
as an SQL query on an already overburdened database, so response times are
measured in minutes or hours, when the queries are even allowed to run to
completion. These response times are much faster than the older report-
generation packages, but they still make it difficult to exploit the data. The
goal is to be able to ask a question and still remember the question when the
answer comes back.
OLAP is a significant improvement over ad hoc query systems, because
OLAP systems design the data structure with users in mind. This powerful
and efficient representation is called a cube, which is ideally suited for slicing
and dicing data. The cube itself is stored either in a relational database, typi­
cally using a star schema, or in a special multidimensional database that opti­
mizes OLAP operations. In addition, OLAP tools provide handy analysis
functions that are difficult or impossible to express in SQL. If OLAP tools have
one downside, it is that business users start to focus only on the dimensions of
data represented by the tool. Data mining, on the other hand, is particularly
valuable for creative thinking.
Setting up the cube requires analyzing the data and the needs of the end
users, which is generally done by specialists familiar with the data and the
tool, through a process called dimensional modeling. Although designing and
loading an OLAP system requires an initial investment, the result provides
informative and fast access to end users, generally much more helpful than the
results from a query-generation tool. Response times, once the cube has been
built, are almost always measured in seconds, allowing users to explore data
and drill down to understand interesting features that they encounter.
OLAP is a powerful enhancement to earlier reporting methods. Its power
rests on three key features:
First, a well-designed OLAP system has a set of relevant dimensions”

such as geography, product, and time”understandable to business
users. These dimensions often prove important for data mining
Second, a well-designed OLAP system has a set of useful measures rele­

vant to the business.
Third, OLAP systems allow users to slice and dice data, and sometimes

to drill down to the customer level.
Data Warehousing, OLAP, and Data Mining 497

T I P Quick response times are important for getting user acceptance of
reporting systems. When users have to wait, they may forget the question that
they asked. Interactive response times as experienced by end users should be
in the range of 3“5 seconds.

These capabilities are complementary to data mining, but not a substitute
for it. Nevertheless, OLAP is a very important (perhaps even the most impor­
tant) part of the data warehouse architecture because it has the largest number
of users.

What™s in a Cube?
A good way to approach OLAP is to think of data as a cube split into subcubes,
as shown in Figure 15.5. Although this example uses three dimensions, OLAP
can have many more; three dimensions are useful for illustrative purposes.
This example shows a typical retailing cube that has one dimension for time,
another for product, and a third for store. Each subcube contains various mea­
sures indicating what happened regarding that product in that store on that
date, such as:
Total number of items sold

Total value of the items

Total amount of discount on the items

Inventory cost of the items

The measures are called facts. As a rule of thumb, dimensions consist of cat­
egorical variables and facts are numeric. As users slice and dice the data, they
are aggregating facts from many different subcubes. The dimensions are used
to determine exactly which subcubes are used in the query.
Even a simple cube such as the one described above is very powerful.
Figure 15.6 shows an example of summarizing data in the cube to answer the
question “On how many days did a particular store not sell a particular prod­
uct?” Such a question requires using the store and product dimension to deter­
mine which subcubes are used for the query. This question only looks at one
fact, the number of items sold, and returns all the dates for which this value is
0. Here are some other questions that can be answered relatively easily:
What was the total number of items sold in the past year?

What were the year over year sales, by month, of stores in the Northeast?

What was the overall margin for each store in November? (Margin

being the price paid by the customer minus the inventory cost.)
498 Chapter 15


. 98
( 137 .)