ñòð. 126 |

cost, 548

information technology and user

derived variables, 542

roles, 58â€“60

discussed, 542

problems, identifying, 56â€“57

identification, 548

ratios, 75

ignored, 547

results, deliverables, 58

input, 547

results, how to use, 57â€“58

with one value, 544â€“546

summarization, 44

target, 547

virtuous cycle, 28â€“30

with unique values, 546â€“547

dirty, 592â€“593

weight, 548

dumping, flat files, 594

comparisons, 83

enterprise-wide, 33

for customer signatures, cataloging,

ETL (extraction, transformation, and

559â€“560

load) tools, 487

data correction

gigabytes, 5

categorical variables, 73

as graphs, 337

encoding, inconsistent, 74

historical

missing values, 73â€“74

customer behaviors, 5

numeric variables, 73

MBR (memory-based reasoning),

outliners, 73

262â€“263

overview, 72

neural networks, 219

skewed distributions, 73

prediction tasks, 10

values with meaning, 74

house-hold level, 96

data exploration

imperfections in, 34

assumptions, validating, 67

inconsistent, 593â€“594

descriptions, comparing values

as information, 22

with, 65

metadata repository, 484, 491

624 Index

data (continued)

outsourcing, 522â€“524

missing data

platforms, 527

data correction, 73â€“74

scalability, 533â€“534

NULL values, 590

scoring platforms, 527â€“528

splits, decision trees, 174â€“175

staffing, 525â€“526

operational feedback, 485, 492

typical operational systems

patterns

versus, 33

meaningful discoveries, 56

undirected

prediction, 45

affinity grouping, 57

untruthful learning sources, 45â€“46

clustering, 57

point-of-sale

discussed, 7

association rules, 288

Data Preparation for Data Mining

scanners, 3

(Dorian Pyle), 75

as useful data source, 60

The Data Warehouse Toolkit (Ralph

preparation

Kimball), 474

automatic cluster detection,

data warehousing

363â€“365

customer patterns, 5

categorical values, neural networks,

for decision support, 13

239â€“240

discussed, 4

continuous values, neural

database administrators (DBAs), 488

networks, 235â€“237

databases

quality, association rules, 308

call detail, 37

representation, generic algorithms,

demographic, 37

432â€“433

KDD (knowledge discovery in

scarce, 62

databases), 8

source systems, 484, 486â€“487

server platforms, affordability, 13

SQL, time series analysis, 572â€“573

datasets, balanced, model sets, 68

terabytes, 5

dates and times, interval variables,

truncated, 162

551

useful data sources, 60â€“61

DBAs (database administrators), 488

visualization tools, 65

deaths, house-hold level data, 96

wrong level of detail, untruthful

debt, nonrepayment of, credit

learning sources, 47

risks, 114

data mining

decision support

architecture, 528â€“532

data warehousing for, 13

as creative process, 33

hypothesis testing, 50â€“51

directed

summary data, OLAP, 477â€“478

classification, 57

decision trees

discussed, 7

alphas, 188

estimation, 57

alternate representations for, 199â€“202

prediction, 57

applying to sequential events, 205

documentation, 536â€“537

branching nodes, 176

goals of, 7

building models, 8

insourcing, 524â€“525

case-study, 206, 208

Index 625

deep intimacy, customer relationships,

for catalog response models, 175

449, 451

classification, 9, 166â€“168

default classes, records, 194

cost considerations, 195

default risks, proof-of-concept

effectiveness of, measuring, 176

projects, 599

estimation, 170

degrees of freedom values, chi-square

as exploration tool, 203â€“204

tests, 152â€“153

fields, multiple, 195â€“197

democracy approach, memory-based

neural networks, 199

reasoning, 279â€“281

profiling tasks, 12

demographic databases, 37

projective visualization, 207â€“208

demographic profiles, customers, 31

pruning

density

C5 algorithm, 190â€“191

data selection, 62â€“63

CART algorithm, 185, 188â€“189

density function, statistics, 133

discussed, 184

ñòð. 126 |