. 105
( 137 .)


data needed as a model input resides in flat files on one system, and the cus­
tomer marketing database resides on another system but the two are accurate
as of different dates,this too can be a data processing challenge.

One Example of a Production Data Mining Architecture
Web retailing is an industry that has gone farther than most in routinely incor­
porating data mining and scoring into the operational environment. Many
Web retailers update a customer™s profile with every transaction and use
model scores to determine what to display and what to recommend. The archi­
tecture described here is from Blue Martini, a company that supplies software
for mining-ready retail Web sites. The example it provides of how data mining
can be made an integral part of a company™s operations is not restricted to Web
retailing. Many companies could benefit from a similar architecture.

Architectural Overview
The Blue Martini architecture is designed to support the differing needs of
marketers, merchandisers, and, not least, data miners. As shown in Figure
16.2, it has three modules for three different types of users. For merchandisers,
this architecture supports multiple product hierarchies and tools for control­
ling collections and promotions. For marketers there are tools for making con­
trolled experiments to track the effectiveness of various messages and
marketing rules. For data miners, there is integrated modeling software and
relief from having to create customer signatures by hand from dozens of dif­
ferent Web server and application logs. The architecture is what Ralph Kimball
and Richard Merz would call a data Webhouse, made up of several special-
purpose data marts with different schemas, all using common field definitions
and shared metadata.
Customers at a Web store interact with pages generated as needed from a
database that includes product information and the page templates. The con­
tents of the page are driven by rules. Some of these rules are business rules
entered by managers. Others are generated automatically and then edited by
professional merchandisers.
Building the Data Mining Environment 529

Product Hierarchies
Web Server with logs

Model Scores
Business Data Customer Interaction Application Server
Definition Module Module with logs
OLTP Database for
Customer Interaction
Business Rules

Analysis Module

Customer OLAP
Signatures for Database for
Mining Reporting

Figure 16.2 Blue Martini provides a good example of an IT architecture for data
mining“driven Web retailing.

Generating pages from a database has many advantages. First it makes it
possible to enforce a consistent look and feel across the Web site. Such stan­
dard interfaces help customers navigate through the site. Using a database
also makes it possible to make global changes quickly, such as updating prices
for a sale. Another feature is the ability to store templates in different lan­
guages and currencies, so the site can be customized for users in different
counties. From the data mining perspective, a major advantage is that all cus­
tomer interactions are logged in the database.
User interactions are managed through a collection of data marts. Reporting
and mining are centered on a customer behavior data mart that includes infor­
mation derived from the user interaction, product, and business-rule data
marts. The complicated extract and transformation logic required to create
customer signatures from transaction data is part of the system”a great sim­
plification for anyone who has ever tried massaging Web logs to get informa­
tion about customers.

Customer Interaction Module
This architecture includes the databases and software needed to support mer­
chandising, customer interaction, reporting, and mining as well as customer-
centric marketing in the form of personalization. The Blue Martini system has
530 Chapter 16

three major modules, each with its own data mart. These repositories keep
track of the following:
Business rules

Customer and visitor transactions

Customer behavior

The customer behavior data mart, shown in Figure 16.2 as part of the analy­
sis module, is fed by data from the customer interaction module, and it, in
turn, supplies rules to both the business data definition module and the cus­
tomer interaction module.
Merchandising information such as product hierarchies, assortments (fami­
lies of products that are grouped together for merchandising purposes), and
price lists are maintained in the business rules data mart, as is content infor­
mation such as Web page templates, images, sounds, and video clips. Business
rules include personalization rules for greeting named customers, promotion
rules, cross-sell rules, and so on. Much of the data mining effort for a retail site
goes into generating these rules.
The customer interaction module is the part of the system that touches cus­
tomers directly by processing all the customer transactions. The customer
interaction module is responsible for maintaining users™ sessions and context.
This module implements the actual Web store and collects any data that may
be wanted for later analysis. The customer transaction data mart logs business
events such as the following:
Customer adds an item to the basket.

Customer initiates check-out process.

Customer completes check-out process.

Cross-sell rule is triggered, and recommendation is made.

Recommended link is followed.

The customer interaction module supports marketing experiments by
implementing control groups and keeping track of multiple rules. It has
detailed knowledge of the content it serves and can track many things that are
not tracked in the Web server logs. The customer interaction module collects
data that allows both products and customers to be tracked over time.

Analysis Module
The database that supports the customer interaction module, like most online
transaction processing systems, is a relational database designed to support
quick transaction processing. Data destined for the analytic module must be
extracted and transformed to support the structures suitable for mining and
reporting. Data mining requires flat signature tables with one row per customer
Building the Data Mining Environment 531

or item to be studied. This means transformations that flatten product hierar­
chies so that, for example, the same transaction might generate one flag indi­
cating that the customer bought French wine, another that he or she bought a
wine from the Burgundy region, and a third indicating that the wine was from
the Beaujolais district in Burgundy. Other data must be rolled up from order
files, billing files, and session logs that contain multiple transactions per cus­
tomer. Typical values derived this way include total spending by category,
average order amount, difference between this customer™s average order and
the mean average order, and the number of days since the customer last made
a purchase.
Reporting is done from a multidimensional database that allows retrospec­
tive queries at various levels. Data mining and OLAP are both part of the
analysis module, although they answer different kinds of questions. OLAP
queries are used to answer questions such as these:
What are the top-selling products?


What are the worst-selling products?


What are the top pages viewed?


What are conversion rates by brand name?


What are the top referring sites by visit count?


What are the top referring sites by dollar sales?


How many customers abandoned market baskets?


Data mining is used to answer more complicated questions such as these:
What are the characteristics of heavy spenders? Does this user fit the

What promotion should be offered to this customer?

What is the likelihood that this customer will return within 1 month?

What customers should we worry about because they haven™t visited

the site recently?
Which products are associated with customers who spend the most

Which products are driving sales of which other products?

In Figure 16.2, the arrow labeled “build data warehouse” connects the cus­
tomer interaction module to the analysis module and represents all the trans­
formations that must occur before either data mining or reporting can be done
properly. Two more arrows, labeled “deploy results,” show the output of the
analysis module being shipped back to the business data definition and cus­
tomer interaction modules. Yet another arrow, labeled “stage data,” shows
how the business rules embedded in the business definition module feed into
the customer interacting module.
532 Chapter 16

What is appealing about this architecture is the way that it facilitates the vir­
tuous cycle of data mining by allowing new knowledge discovered through
data mining to be fed directly to the systems that interact with customers.

Data Mining Software
One of the ways that the data mining world has changed most since the first
edition of this book came out is the maturity of data mining software products.
Robustness, usability, and scalability have all improved significantly. The one
thing that may have decreased is the number of data mining software vendors
as tiny boutique software firms have been pushed aside by larger, more estab­
lished companies. As stated in the first edition, it is not reasonable to compare

the merits of particular products in a book intended to remain useful beyond

the shelf-life of the current versions of these products. Although the products
are changing”and hopefully improving”over time, the criteria for evaluat­
ing them have not changed: Price, availability, scalability, support, vendor
relationships, compatibility, and ease of integration all factor into the selection

Range of Techniques
As must be clear by now, there is no single data mining technique that is


. 105
( 137 .)