<<

. 126
( 137 .)



>>

discussed, 74

cost, 548
information technology and user

derived variables, 542
roles, 58“60

discussed, 542
problems, identifying, 56“57

identification, 548
ratios, 75

ignored, 547
results, deliverables, 58

input, 547
results, how to use, 57“58

with one value, 544“546
summarization, 44

target, 547
virtuous cycle, 28“30

with unique values, 546“547
dirty, 592“593

weight, 548
dumping, flat files, 594

comparisons, 83
enterprise-wide, 33

for customer signatures, cataloging,
ETL (extraction, transformation, and

559“560
load) tools, 487

data correction
gigabytes, 5

categorical variables, 73
as graphs, 337

encoding, inconsistent, 74
historical

missing values, 73“74
customer behaviors, 5

numeric variables, 73
MBR (memory-based reasoning),

outliners, 73
262“263

overview, 72
neural networks, 219

skewed distributions, 73
prediction tasks, 10

values with meaning, 74
house-hold level, 96

data exploration
imperfections in, 34

assumptions, validating, 67
inconsistent, 593“594

descriptions, comparing values
as information, 22

with, 65
metadata repository, 484, 491

624 Index


data (continued)
outsourcing, 522“524

missing data
platforms, 527

data correction, 73“74
scalability, 533“534

NULL values, 590
scoring platforms, 527“528

splits, decision trees, 174“175
staffing, 525“526

operational feedback, 485, 492
typical operational systems

patterns
versus, 33

meaningful discoveries, 56
undirected

prediction, 45
affinity grouping, 57

untruthful learning sources, 45“46
clustering, 57

point-of-sale
discussed, 7

association rules, 288
Data Preparation for Data Mining
scanners, 3
(Dorian Pyle), 75

as useful data source, 60
The Data Warehouse Toolkit (Ralph

preparation
Kimball), 474

automatic cluster detection,
data warehousing

363“365
customer patterns, 5

categorical values, neural networks,
for decision support, 13

239“240
discussed, 4

continuous values, neural
database administrators (DBAs), 488

networks, 235“237
databases

quality, association rules, 308
call detail, 37

representation, generic algorithms,
demographic, 37

432“433
KDD (knowledge discovery in

scarce, 62
databases), 8

source systems, 484, 486“487
server platforms, affordability, 13

SQL, time series analysis, 572“573
datasets, balanced, model sets, 68

terabytes, 5
dates and times, interval variables,

truncated, 162
551

useful data sources, 60“61
DBAs (database administrators), 488

visualization tools, 65
deaths, house-hold level data, 96

wrong level of detail, untruthful
debt, nonrepayment of, credit

learning sources, 47
risks, 114

data mining
decision support

architecture, 528“532
data warehousing for, 13

as creative process, 33
hypothesis testing, 50“51

directed
summary data, OLAP, 477“478

classification, 57
decision trees

discussed, 7
alphas, 188

estimation, 57
alternate representations for, 199“202

prediction, 57
applying to sequential events, 205

documentation, 536“537
branching nodes, 176

goals of, 7
building models, 8

insourcing, 524“525
case-study, 206, 208

Index 625


deep intimacy, customer relationships,
for catalog response models, 175

449, 451

classification, 9, 166“168

default classes, records, 194

cost considerations, 195

default risks, proof-of-concept

effectiveness of, measuring, 176

projects, 599

estimation, 170

degrees of freedom values, chi-square

as exploration tool, 203“204

tests, 152“153

fields, multiple, 195“197

democracy approach, memory-based

neural networks, 199

reasoning, 279“281

profiling tasks, 12

demographic databases, 37

projective visualization, 207“208

demographic profiles, customers, 31

pruning

density

C5 algorithm, 190“191

data selection, 62“63
CART algorithm, 185, 188“189

density function, statistics, 133

discussed, 184

<<

. 126
( 137 .)



>>