<<

. 99
( 137 .)



>>




Shop




t
uc
od
Date




Pr
shop = Pinewood
product = 4
date = ˜7 Mar 2004™
count = 5 Dimension columns
value = $215
Aggregate columns
discount = $32
cost = $75

Figure 15.5 The cube used for OLAP is divided into subcubes. Each subcube contains the
key for that subcube and summary information for the data falls into that subcube.


Of course, the ease of getting a report that can answer one of these questions
depends on the particular implementation of the reporting interface. How-
ever, even for ad hoc reporting, accessing the cube structure can prove much
easier than accessing a normalized relational database.

Three Varieties of Cubes
The cube described in the previous section is an example of a summary data
cube. This is a very common example in OLAP. However, not all cubes are
summary cubes. And, a data warehouse may contain many different cubes for
different purposes.
Data Warehousing, OLAP, and Data Mining 499




These subcubes
correspond to the
purchase of the




Store
same product at
one store on all
days.




t
uc
od
Pr
Date




These are some of
store = X store = X store = X store = X
product = Y product = Y product = Y product = Y
the subcubes in more
date = date = date = date =
detail.
count = 1 count = 5 count = 0 count = 1
value = $44 value = 215 value = $0 value = $44



The answer to the question is the number of subcubes where count is not
equal to 0.
Figure 15.6 On how many days did store X not sell any product Y ?


Another type of cube represents individual events. These cubes contain the
most detailed data related to customer interactions, such as calls to customer
service, payments, individual bills, and so on. The summaries are made by
aggregating events across the cube. Such event cubes typically have a cus-
tomer dimension or something similar, such as an account, Web cookie, or
household, which ties the event back to the customer. A small number of
dimensions, such as the customer ID, date, and event type are often sufficient
for identifying each subcube. However, an event cube often has several other
dimensions, which provide more detailed information and are important for
aggregating data. The facts in such a table often contain dollar amounts and
counts.
Event cubes are very powerful. Their use is limited because they rapidly
become very big”the database tables representing them can have millions,
hundreds of millions, or even billions of rows. Even with the power of OLAP
and parallel computers, such cubes require a bit of processing time for routine
queries. Nonetheless, event cubes are particularly valuable because they make
it possible to “drill down” from other cubes”to find the exact set of events
used for calculating a particular value.
500 Chapter 15


The third type of cube is a variant on the event cube. This is the factless fact
table, whose purpose is to represent the evidence that something occurred. For
instance, there might be a factless fact table that specifies the prospects
included in a direct mail campaign. Such a fact table might have the following
dimensions:
Prospect ID (perhaps a household ID)
––


Source of the prospect name
––


Target date of the mailing
––


Type of message
––


Type of creative
––

Type of offer
––


This is a case where there may not be any numeric facts to store about
the individual name. Of course, there might be interesting attributes for the
dimensions”such as the promotional cost of the offers and the cost of the
names. However, this data is available through the dimensions and hence does
not need to be repeated at the individual prospect level.
Regardless of the type of fact table, there is one cardinal rule: any particular
item of information should fall into exactly one subcube. When this rule is vio­
lated, the cube cannot easily be used to report on the various dimensions. A
corollary of this rule is that when an OLAP cube is being loaded, it is very
important to keep track of any data that has unexpected dimensional values.
Every dimension should have an “other” category to guarantee that all data
makes it in.

T I P When choosing the dimensions for a cube, be sure that each record lands
in exactly one subcube. If you have redundant dimensions”such as one
dimension for date and another for day of the week”then the same record will
land in two or more subcubes. If this happens, then the summarizations based
on the subcubes will no longer be accurate.

Apart from the cardinal rule that each record inserted into the cube should
land in exactly one subcube, there are three other things to keep in mind when
designing effective cubes:
Determining the facts
––

Handling complex dimensions
––


Making dimensions conform across the data warehouse
––


These three issues arise when trying to develop cubes, and resolving them is
important to making the cubes useful for analytic purposes.
Data Warehousing, OLAP, and Data Mining 501


Facts
Facts are the measures in each subcube. The most useful facts are additive, so
they can be combined together across many different subcubes to provide
responses to queries at arbitrary levels of summarization. Additive facts make
it possible to summarize data along any dimension or along several dimen­
sions at one time”which is exactly the purpose of the cube.
Examples of additive facts are:
Counts
––


Counts of variables with a particular value
––


Total duration of time (such as spent on a web site)
––


Total monetary values
––


The total amount of money spent on a particular product on a particular day
is the sum of the amount spent on that product in each store. This is a good
example of an additive fact. However, not all facts are additive. Examples
include:
Averages
––


Unique counts
––


Counts of things shared across different cubes, such as transactions
––


Averages are not a very interesting example of a nonadditive fact, because
an average is a total divided by a count. Since each of these is additive, the
average can be derived after combining these facts.
The other examples are more interesting. One interesting question is how
many unique customers did some particular action. Although this number can
be stored in a subcube, it is not additive. Consider a retail cube with the date,
store, and product dimensions. A single customer may purchase items in more
than one store, or purchase more than one item in a store, or make purchases
on different days. A field containing the number of unique customers has
information about one customer in more than one subcube, violating the
cardinal rule of OLAP, so the cube is not going to be able to report on unique
customers.
A similar thing happens when trying to count numbers of transactions.
Since the information about the transaction may be stored in several different
subcubes (since a single transaction may involve more than one product),
counts of transactions also violate the cardinal rule. This type of information
cannot be gathered at the summary level.
Another note about facts is that not all numeric data is appropriate as a fact
in a cube. For instance, age in years is numeric, but it might be better treated as
a dimension rather than a fact. Another example is customer value. Discrete
502 Chapter 15


ranges of customer value are useful as dimensions, and in many circumstances
more useful than trying to include customer value as a fact.
When designing cubes, there is a temptation to mix facts and dimensions by
creating a count or total for a group of related values. For instance:
Count of active customers of less than 1-year tenure, between 1 and 2
––

years, and greater than 2 years
Amount credited on weekdays; amount credited on weekends
––


Total for each day of the week
––


Each of these suggests another dimension for the cube. The first should have
a customer tenure dimensions that takes at least three values. The second
appeared in a cube where the time dimension was by month. These facts sug­




Y
gest a need for daily summaries, or at least for separating weekdays and week­




FL
ends along a dimension. The third suggests a need for a date dimension at the
granularity of days. AM
Dimensions and Their Hierarchies
Sometimes, a single column seems appropriate for multiple dimensions. For
TE

<<

. 99
( 137 .)



>>