. 2
( 137 .)


first book. The methodology introduced here is designed to build on the suc­
cessful engagements we have been involved in. Chapter 4, which has no coun­
terpart in the first edition, is about applications of data mining in marketing
and customer relationship management, the fields where most of our own
work has been done.
The second part consists of the technical chapters about the data mining
techniques themselves. All of the techniques described in the first edition are
still here although they are presented in a different order. The descriptions
have been rewritten to make them clearer and more accurate while still retain­
ing nontechnical language wherever possible.
In addition to the seven techniques covered in the first edition ” decision
trees, neural networks, memory-based reasoning, association rules, cluster
detection, link analysis, and genetic algorithms ” there is now a chapter on
data mining using basic statistical techniques and another new chapter on sur­
vival analysis. Survival analysis is a technique that has been adapted from the
small samples and continuous time measurements of the medical world to the
Introduction xxv

large samples and discrete time measurements found in marketing data. The
chapter on memory-based reasoning now also includes a discussion of collab­
orative filtering, another technique based on nearest neighbors that has
become popular with Web retailers as a way of generating recommendations.
The third part of the book talks about applying the techniques in a business
context, including a chapter on finding customers in data, one on the relation­
ship of data mining and data warehousing, another on the data mining envi­
ronment (both corporate and technical), and a final chapter on putting data
mining to work in an organization. A new chapter in this part covers prepar­
ing data for data mining, an extremely important topic since most data miners
report that transforming data takes up the majority of time in a typical data
mining project.
Like the first edition, this book is aimed at current and future data mining
practitioners. It is not meant for software developers looking for detailed
instructions on how to implement the various data mining algorithms nor for
researchers trying to improve upon those algorithms. Ideas are presented in
nontechnical language with minimal use of mathematical formulas and arcane
jargon. Each data mining technique is shown in a real business context with
examples of its use taken from real data mining engagements. In short, we
have tried to write the book that we would have liked to read when we began
our own data mining careers.
” Michael J. A. Berry, October, 2003

Acknowledgments xix
About the Authors xxi
Introduction xxiii
Chapter 1 Why and What Is Data Mining? 1
Analytic Customer Relationship Management 2
The Role of Transaction Processing Systems 3
The Role of Data Warehousing 4
The Role of Data Mining 5
The Role of the Customer Relationship Management Strategy 6
What Is Data Mining? 7
What Tasks Can Be Performed with Data Mining? 8
Classification 8
Estimation 9
Prediction 10
Affinity Grouping or Association Rules 11
Clustering 11
Profiling 12
Why Now? 12
Data Is Being Produced 12
Data Is Being Warehoused 13
Computing Power Is Affordable 13
Interest in Customer Relationship Management Is Strong 13
Every Business Is a Service Business 14
Information Is a Product 14
Commercial Data Mining Software Products
Have Become Available 15

vi Contents

How Data Mining Is Being Used Today 15

A Supermarket Becomes an Information Broker 15

A Recommendation-Based Business 16

Cross-Selling 17

Holding on to Good Customers 17

Weeding out Bad Customers 18

Revolutionizing an Industry 18

And Just about Anything Else 19

Lessons Learned 19

Chapter 2 The Virtuous Cycle of Data Mining 21

A Case Study in Business Data Mining 22

Identifying the Business Challenge 23

Applying Data Mining 24

Acting on the Results 25

Measuring the Effects 25

What Is the Virtuous Cycle? 26

Identify the Business Opportunity 27

Mining Data 28

Take Action 30

Measuring Results 30

Data Mining in the Context of the Virtuous Cycle 32

A Wireless Communications Company Makes

the Right Connections 34

The Opportunity 34

How Data Mining Was Applied 35

Defining the Inputs 37

Derived Inputs 37

The Actions 38

Completing the Cycle 39

Neural Networks and Decision Trees Drive SUV Sales 39

The Initial Challenge 39

How Data Mining Was Applied 40

The Data 40

Down the Mine Shaft 40

The Resulting Actions 41

Completing the Cycle 42

Lessons Learned 42

Chapter 3 Data Mining Methodology and Best Practices 43

Why Have a Methodology? 44

Learning Things That Aren™t True 44

Patterns May Not Represent Any Underlying Rule 45

The Model Set May Not Reflect the Relevant Population 46

Data May Be at the Wrong Level of Detail 47

Contents vii

Learning Things That Are True, but Not Useful

Learning Things That Are Already Known

Learning Things That Can™t Be Used

Hypothesis Testing

Generating Hypotheses

Testing Hypotheses

Models, Profiling, and Prediction

Profiling 53


The Methodology 54

Step One: Translate the Business Problem

into a Data Mining Problem 56

What Does a Data Mining Problem Look Like? 56

How Will the Results Be Used? 57

How Will the Results Be Delivered? 58

The Role of Business Users and Information Technology 58

Step Two: Select Appropriate Data 60

What Is Available? 61

How Much Data Is Enough? 62

How Much History Is Required? 63

How Many Variables? 63

What Must the Data Contain? 64

Step Three: Get to Know the Data 64

Examine Distributions 65

Compare Values with Descriptions 66

Validate Assumptions 67

Ask Lots of Questions 67

Step Four: Create a Model Set 68

Assembling Customer Signatures 68

Creating a Balanced Sample 68

Including Multiple Timeframes 70

Creating a Model Set for Prediction 70

Partitioning the Model Set 71

Step Five: Fix Problems with the Data 72

Categorical Variables with Too Many Values 73

Numeric Variables with Skewed Distributions and Outliers 73

Missing Values 73

Values with Meanings That Change over Time 74

Inconsistent Data Encoding 74

Step Six: Transform Data to Bring Information to the Surface 74

Capture Trends 75

Create Ratios and Other Combinations of Variables 75

Convert Counts to Proportions 75

Step Seven: Build Models 77

viii Contents

Step Eight: Assess Models 78

Assessing Descriptive Models 78


. 2
( 137 .)