<<

. 95
( 137 .)



>>

COMMENT ACCT
ACCT_ID FIRST_NAME
COMMENT_CODE LAST_NAME
COMMENT_TEXT ...
...
This symbol means a product
change has exactly one account
PRODUCT CHANGE
ACCT_ID
OLD_PROD This symbol means an account
might have 0 or more product
NEW_PROD
changes
...




Physical Model
TABLE: CONTACT Information from all four entities in the logical
ACCT_ID model is found in the contact table.
CONTACT_TYPE
CONTACT_DATE The different types of contact are differentiated
COMPLAINT_CODE using the CONTACT_TYPE field.
REFUND_AMOUNT
OLD_PROD The physical model also specifies exact types,
NEW_PROD partitioning, indexes, storage characteristics,
COMMENT_TYPE degrees of parallelism, constraints on values,
COMMENT_TEXT and may other things not of interest to the
... business user.
Figure 15.2 The physical and logical schema may not be related to each other.
480 Chapter 15



WHAT IS A RELATIONAL DATABASE?

One of the most common ways to store data is in a relational database
management system (RDBMS). The basis of relational databases starts with
research by E. F. Codd in the early 1970s on the properties of a special type of
set composed of tuples”what we would call rows in tables. From this, he
derived a relational algebra consisting of operations that form a relational
algebra, which are depicted in the following figure:


Before After
Filter
row col A col B col C col D col E colF row col A col B col C col D col E colF
001 001
002 002

Filtering removes rows based
003 003
004 004
005 005
on the values in one or more
006 006
007 007
columns. Each output row
008 008
009 009
010
either is or is not in the input 010
011 011
012 012
table.




Select
row col A col B col C col D col E colF row col A col B col C col D col E colF new
001 001
002 002
003
Selecting chooses the columns 003
004 004
005 005
for the output. Each column in
006 006
007 007
the output is in the input or is a
008 008
009 009
010
function of some of the input 010
011 011
012 012
columns.



Aggregation (or Group by)
row col A col B col C col D col E colF
001 key1
Aggregation groups columns
002 key1
003 key2
col A avg B max B sum D sum E sum F
together based on a common
004 key2
key1
005 key2
key2
006 key2
key. All the rows with the same key3
007 key3
key4
008 key3
key are summarized into a
009 key3
010 key4
single output row.
011 key4
012 key4




row col A col B col C
Join
001 key1 col A col B col C col G
002 key1 key1
003 key2 key1
row col A col G
004 key2 key3
001 key1
key2
005 key3
Join matches rows in two
002 key3
006 key2 key3
003 key4
007 key3 key4
004 key4
tables. For every pair of rows
008 key3 key4
009 key3 key4
whose keys match in the inputs,
010 key4 key4
011 key4 key4
012 key4 key4
a new row is created in the
output.

Relational databases have four major querying operations.
Data Warehousing, OLAP, and Data Mining 481



These operations are in addition to set operations, such as union and
intersection. In nonscientific terminology, these relational operations are:
Filter a given set of rows based on the values in the rows.
Select a given set of columns and perform basic operations on them.
Group rows together and aggregate values in the columns.
Join two tables together based on the values in the columns.
Interestingly, the relational operations do not include sorting (except for
output purposes). These operations specify what can be done with tuples, not
how it gets done. In fact, relational databases often use sorting for grouping
and joining operations; however, there are non-sort-based algorithms for these
operations as well.
SQL, developed by IBM in the 1980s, has become the standard language for
accessing relational databases and implements these basic operations.
Because SQL supports subqueries (that is, using the results of one query as a
table in another query), it is possible to express some very complex data
manipulations.
A common way of representing the database structure is to use an entity-
relationship (E-R) diagram. The following figure is a simple E-R diagram with
five entities and four relationships among them. In this case, each entity
corresponds to a separate table with columns corresponding to the attributes
of the entity. In addition, columns represent the relationships between tables
in the database; such columns are called keys (either foreign or primary keys).
Explicitly storing keys in the database tables using a consistent naming
convention facilitates finding one™s way around the database.
One nice feature of relational databases is the ability to design a database
so that any given data item appears in exactly one place”with no duplication.
Such a database is called a normalized database. Knowing exactly where each
data item is located is highly efficient in theory, since updating any field
requires modifying only one row in one table. When a normalized database is
well-designed and implemented, there is no redundant data, out-of-date data,
or invalid data.
An important idea behind normalization is creating reference tables. Each
reference table logically corresponds to an entity, and each has a key used for
looking up information about the entity. In a normalized database, the “join”
operation is used to lookup values in reference tables.
Relational databases are a powerful way of storing and accessing data.
However, much of their design is focused on updating the data and handling
large numbers of transactions. Data mining is interested in combining data
together to spot higher level patterns.. Typically, data mining uses many
queries, each of which requires several joins, several aggregations, and
subqueries”a veritable army of killer queries.
(continued)
482 Chapter 15



WHAT IS A RELATIONAL DATABASE? (continued)
A single transaction occurs at exactly one
vendor. But, each vendor may have multiple
transactions.
TRANSACTION TABLE
Transaction ID
VENDOR TABLE
Account ID
Vendor ID
Vendor ID
Vendor Name
Date
Vendor Type
Time
One account has multiple
Amount
transactions, but each
Authorization Code
transaction is associated
A customer may have one or more
with exactly one account.
accounts. But each account belongs to
exactly one customer. Likewise, one or
more customers may be in a household.
ACCOUNT TABLE
CUSTOMER TABLE
Account ID
Customer ID
Customer ID
Household ID
Account Type




Y
Customer Name
Interest Rate
Date of Birth
Credit Limit




FL
Gender
Minimum Payment
Amount Due FICO Score
Last Payment Amt
AM
HOUSEHOLD TABLE
Household ID
Number of Children
ZIP Code
TE

An E-R diagram can be used to show the tables and fields in a relational database.
Each box shows a single table and its columns. The lines between them show
relationships, such as 1-many, 1-1, and many-to-many. Because each table
corresponds to an entity, this is called a physical design.

Sometimes, the physical design of a database is very complicated. For instance,
the TRANSACTION TABLE might actually be split into a separate table for each
month of transactions. In this case, the above E-R diagram is still useful; it
represents the logical structure of the data, as business users would understand it.

An entity relationship diagram describes the layout of data for a simple credit card
database.


With respect to data mining, relational databases (and SQL) have some
limitations. First, they provide little support for time series. This makes it hard
to figure out from transaction data such things as the second product
purchased, the last three promos a customer responded to, or the ordering of

<<

. 95
( 137 .)



>>