<<

. 96
( 137 .)



>>

events; these can require very complicated SQL. Another problem is that two
operations often eliminate fields inadvertently. When a field contains a missing
value (NULL) then it automatically fails any comparison, even “not equals”.




Team-Fly®
Data Warehousing, OLAP, and Data Mining 483



Also, the default join operation (called an inner join) eliminates rows that do
not match, which means that customers may inadvertently be left out of a data
pull. The set of operations in SQL is not particularly rich, especially for text
fields and dates. The result is that every database vendor extends standard SQL
to include slightly different sets of functionality.




Database schema can also illuminate unusual findings in the data. For
instance, we once worked with a file of call detail records in the United States
that had city and state fields for the destination of every call. The file contained
over two hundred state codes”that is a lot of states. What was happening? We
learned that the city and state fields were never used by operational systems,
so their contents were automatically suspicious”data that is not used is not
likely to be correct. Instead of the city and state, all location information was
derived from zip codes. These redundant fields were inaccurate because the
state field was written first and the city field, with 14 characters, was written
second. Longer city names overwrote the state field next to it. So, “WEST
PALM BEACH, FL” ended up putting the “H” in the state field, becoming
“WEST PALM BEAC, HL,” and “COLORADO SPRINGS, CO” became
“COLORADO SPRIN, GS.” Understanding the data layout helped us figure
out this interesting but admittedly uncommon problem.


Metadata
Metadata goes beyond the database schema to let business users know what
types of information are stored in the database. This is, in essence, documen­
tation about the system, including information such as:
The values legally allowed in each field
––


A description of the contents of each field (for instance, is the start date
––

the date of the sale or the date of activation)
The date when the data was loaded
––


An indication of how recently the data has been updated (when after
––

the billing cycle does the billing data land in this system?)
Mappings to other systems (the status code in table A is the status code
––

field in table B in such-and-such source system)
When available, metadata provides an invaluable service. When not avail­
able, this type of information needs to be gleaned, usually from friendly data­
base administrators and analysts”a perhaps inefficient use of everyone™s
time. For a data warehouse, metadata provides discipline, since changes to the
484 Chapter 15


warehouse must be reflected in the metadata to be communicated to users.
Overall, a good metadata system helps ensure the success of a data warehouse
by making users more aware of and comfortable with the contents. For data
miners, metadata provides valuable assistance in tracking down and under­
standing data.


Business Rules
The highest level of abstraction is business rules. These describe why relation­
ships exist and how they are applied. Some business rules are easy to capture,
because they represent the history of the business”what marketing cam­
paigns took place when, what products were available when, and so on. Other
types of rules are more difficult to capture and often lie buried deep inside
code fragments and old memos. No one may remember why the fraud detec­
tion system ignores claims under $500. Presumably there was a good business
reason, but the reason, the business rule, is often lost once the rule is embed­
ded in computer code.
Business rules have a close relationship to data mining. Some data mining
techniques, such as market basket analysis and decision trees, produce explicit
rules. Often, these rules may already be known. For instance, learning that
conference calling is sold with call waiting may not be interesting, since this
feature is only sold as part of a bundle. Or a direct mail model response model
that ends up targeting only wealthy areas may reflect the fact that the histori­
cal data used to build the model was biased, because the model set only had
responders in these areas.
Discovering business rules in the data is both a success and a failure. Find­
ing these rules is a successful application of sophisticated algorithms. How­
ever, in data mining, we want actionable patterns and such patterns are not
actionable.


A General Architecture for Data Warehousing
The multitiered approach to data warehousing recognizes that data needs
come in many different forms. It provides a comprehensive system for man­
aging data for decision support. The major components of this architecture
(see Figure 15.3) are:
Source systems are where the data comes from.
––


Extraction, transformation, and load (ETL) move data between different
––

data stores.
Data Warehousing, OLAP, and Data Mining 485


The central repository is the main store for the data warehouse.
––


The metadata repository describes what is available and where.
––


Data marts provide fast, specialized access for end users and applications.
––


Operational feedback integrates decision support back into the opera­
––

tional systems.
End users are the reason for developing the warehouse in the first place.
––




End users are the raison d'etre of the data
warehouse. They act on the information
and knowledge gained from the data.




Networks using
standard protocols like
Departmental data warehouses
ODBC connect end
and metadata support Meta-
users to the data.
applications used by end users. data




The central data store is
a relational database
with a logical data model. Central Repository



Extraction, transformation,
and load tools move data
between systems.




Operational systems are where the data
comes from. These are usually
mainframe or midrange systems.
External Data
Some data may be provided by external
vendors.


Figure 15.3 The multitiered approach to data warehousing includes a central repository,
data marts, end-user tools, and tools that connect all these pieces together.
486 Chapter 15


One or more of these components exist in virtually every system called a
data warehouse. They are the building blocks of decision support throughout
an enterprise. The following discussion of these components follows a data­
flow approach. The data is like water. It originates in the source systems and
flows through the components of the data warehouse ultimately to deliver
information and value to end users. These components rest on a technological
foundation consisting of hardware, software, and networks; this infrastructure
must be sufficiently robust both to meet the needs of end users and to meet
growing data and processing requirements.


Source Systems
Data originates in the source systems, typically operational systems and exter­
nal data feeds. These are designed for operational efficiency, not for decision
support, and the data reflects this reality. For instance, transactional data
might be rolled off every few months to reduce storage needs. The same infor­
mation might be represented in different ways. For example, one retail point-
of-sale source system represented returned merchandise using a “returned
item” flag. That is, except when the customer made a new purchase at the
same time. In this case, there would be a negative amount in the purchase
field. Such anomalies abound in the real world.
Often, information of interest for customer relationship management is not
gathered as intended. Here, for instance, are six ways that business customers
might be distinguished from consumers in a telephone company:
Using a customer type indicator: “B” or “C,” for business versus
––

consumer.
Using rate plans: Some are only sold to business customers; others to
––

consumers.
Using acquisition channels: Some channels are reserved for business,
––

others for consumers.
Using number of lines: 1 or 2 for consumer, more for business.
––


Using credit class: Businesses have a different set of credit classes from
––

consumers.
Using a model score based on businesslike calling patterns
––


(Needless to say, these definitions do not always agree.) One challenge in
data warehousing is arriving at a consistent definition that can be used across
the business. The key to achieving this is metadata that documents the precise
meaning of each field, so everyone using the data warehouse is speaking the
same language.
Data Warehousing, OLAP, and Data Mining 487


Gathering the data for decision support stresses operational systems since
these systems were originally designed for transaction processing. Bringing
the data together in a consistent format is almost always the most expensive
part of implementing a data warehousing solution.
The source systems offer other challenges as well. They generally run on a
wide range of hardware, and much of the software is built in-house or highly
customized. These systems are commonly mainframe and midrange systems
and generally use complicated and proprietary file structures. Mainframe sys­
tems were designed for holding and processing data, not for sharing it.
Although systems are becoming more open, getting access to the data is
always an issue, especially when different systems are supporting very differ­
ent parts of the organization. And, systems may be geographically dispersed,
further contributing to the difficulty of bringing the data together.


Extraction, Transformation, and Load
Extraction, transformation, and load (ETL) tools solve the problem of gather­
ing data from disparate systems, by providing the ability to map and move
data from source systems to other environments. Traditionally, data move­
ment and cleansing have been the responsibility of programmers, who wrote
special-purpose code as the need arose. Such application-specific code
becomes brittle as systems multiply and source systems change.
Although programming may still be necessary, there are now products that
solve the bulk of the ETL problems. These tools make it possible to specify
source systems and mappings between different tables and files. They provide
the ability to verify data, and spit out error reports when loads do not succeed.
The tools also support looking up values in tables (so only known product
codes, for instance, are loaded into the data warehouse). The goal of these tools
is to describe where data comes from and what happens to it”not to write the
step-by-step code for pulling data from one system and putting it into another.
Standard procedural languages, such as COBOL and RPG, focus on each step
instead of the bigger picture of what needs to be done. ETL tools often provide
a metadata interface, so end users can understand what is happening to

<<

. 96
( 137 .)



>>