. 106
( 137 .)


applicable in all situations. Neural networks, decision trees, market basket
analysis, statistics, survival analysis, genetic algorithms, memory-based rea­
soning, link analysis, and automatic cluster detection all have a place. As
shown in the case studies, it is not uncommon for two or more of these tech­
niques to be applied in combination to achieve results beyond the reach of any
single method.
Be sure that the software selected is powerful enough to support the data
and goals needed for the organization. It is a good idea to have software a bit
more advanced than the analysts™ abilities, so people can try out new things
that they might not otherwise think of trying. Having multiple techniques
available in a single set of tools is useful, because it makes it easier to combine
and compare different techniques. At the same time, having several different
products makes sense for a larger group, since different products have differ­
ent strengths”even when they support the same underlying functionality.
Some are better at presenting results; some are better at developing scores;
some are more intuitive for novice users.
Assess the range of data mining tasks to be addressed and decide which
data mining techniques will be most valuable. If you have a single application
in mind, or a family of closely related applications, then it is likely that you

Building the Data Mining Environment 533


The following list of questions is designedto help select the right data mining
software for your company. We present the questions as an unordered list. The
first thing you should do is order the list according to your own priorities. These
priorities will necessarily be different from case to case, which is why we have
not attempted to rank them for you. In some environments, for example, there
is an established standard hardware supplier and platform-independence is not
an issue, while in other environments it is of paramount concern so different
divisions can use the package or in anticipation of a future change in hardware.
— What is the range of data mining techniques offered by the vendor?
— How scalable is the product in terms of the size of the data, the number
of users, the number of fields in the data, and its use of the hardware?
— Does the product provide transparent access to databases and files?

— Does the product provide multiple levels of user interfaces?
— Does the product generate comprehensible explanations of the models it
— Does the product support graphics, visualization, and reporting tools?
— Does the product interact well with other software in the environment,
such as reporting packages, databases, and so on?
— Can the product handle diverse data types?

— Is the product well documented and easy to use?

— What is the availability of support, training, and consulting?

— How well will the product fit into the existing computing environment?
— Does the vendor have credible references?

Once you have determined which of these questions are most important
to your organization, use them to assess candidate software packages by
interviewing the software vendors or by enlisting the aid of an independent
data mining consultant.

will be able to select a single technique and stick with it. If you are setting up a
data mining lab environment to handle a wide range of data mining applica­
tions, you will want to look for a coordinated suite of tools.

Data mining provides the greatest benefit when the data to be mined is large
and complex. But, data mining software is likely to be demonstrated on small,
sample datasets. Be sure that the data mining software being considered can
handle the anticipated data volume”and then perhaps a bit more to take into
534 Chapter 16

account future growth (data does not grow smaller over time). The scalability
aspect of data mining is important in three ways:
Transforming the data into customer signatures requires a lot of I/O

and computing power.
Building models is a repetitive and very computationally expensive.

Scoring models requires complex data transformations.

For exploring and transforming data, the most readily available scalable
software are relational databases. These have been designed to take advantage
of multiple processors and multiple disks for handling a single database query.
Another class of software, the extraction, transformation, and load tools (ETL)
used to create databases may also be scalable and useful for data mining.
However, most programming languages do not scale; they only support single
processors and single disks for handling a single task. When there is a lot of
data that needs to be combined, the most scalable solution to handling the data
is often found at this level.
Building models and exploring data require software that runs fast enough
and on large enough quantities of data. Some data mining tools only work on
data in memory, so the volume of data is limited by available memory. This has
the advantage that algorithms run faster. On the other hand there are limits. In
practice, this was a problem when available memory was measured in
megabytes; the gigabytes of memory available even on a typical workstation
ameliorate the problem. Often, the data mining environment puts multiuser
data mining servers on a powerful server close to the data. This is a good solu­
tion. As workstations become more powerful, building the models locally is
also a viable solution. In either case, the goal is to run the models on hundreds
of thousands or millions of rows in a reasonable amount of time. A data min­
ing environment should encourage users to understand and explore the data,
rather than expending effort sampling it down to make it fit in.
The scoring environment is often the most complex, because it require trans­
forming the data and running the models at the same time”preferably with a
minimal amount of user interaction. Perhaps the best solution is when data
mining software can both read and write to relational databases, making it
possible to use the database for scalable data manipulation and the data min­
ing tool for efficient model building.

Support for Scoring
The ability to write to as well as read from a database is desirable when data
mining is used to develop models used for scoring. The models may be devel­
oped using samples extracted from the master database, but once developed,
the models will score every record in the database.
Building the Data Mining Environment 535

The value of a response model decreases with time. Ideally, the results of
one campaign should be analyzed in time to affect the next one. But, in many
organizations there is a long lag between the time a model is developed and
the time it can be used to append scores to a database; sometimes the time is
measured in weeks or months. The delay is caused by the difficulty of moving
the scoring model, which is often developed on a different computer from the
database server, into a form that can be applied to the database. This might
involve interpreting the output of a data mining tool and writing a computer
program that embodies the rules that make up the model.
The problem is even worse when the database is actually stored at a third
facility, such as that of a list processor. The list processor is unlikely to accept a
neural network model in the form of C source code as input to a list selection
request. Building a unified model development and scoring framework
requires significant integration effort, but if scoring large databases is an
important application for your business, the effort will be repaid.

Multiple Levels of User Interfaces
In many organizations, several different communities of users use the data
mining software. In order to accommodate their differing needs, the tool
should provide several different user interfaces:
A graphical user interface (GUI) for the casual user that has reasonable

default values for data mining parameters.
Advanced options for more skilled users.

An ability to build models in batch mode (which could be provided by

a command line interface).
An applications program interface (API) so that predictive modeling

can be built into applications
The GUI for a data mining tool should not only make it easy for users to
build models, it should be designed to encourage best practices such as ensur­
ing that model assessment is performed on a hold-out set and that the target
variables for predictive models come from a later timeframe than the inputs.
The user interface should include a help system, with context-sensitive help.
The user interface should provide reasonable default values for such things
as the minimum number of records needed to support a split in a decision
tree or the number of nodes in the hidden layer of a neural network to improve
the chance of success for casual users. On the other hand, the interface should
make it easy for more knowledgeable users to change the defaults. Advanced
users should be able to control every aspect of the underlying data mining
536 Chapter 16

Comprehensible Output
Tools vary greatly in the extent to which they explain themselves. Rule gener­
ators, tree visualizers, Web diagrams, and association tables can all help.
Some vendors place great emphasis on the visual representation of both
data and rules, providing three-dimensional data terrain maps, geographic
information systems (GIS), and cluster diagrams to help make sense of com­
plex relationships. The final destination of much data mining work is reports
for management, and the power of graphics should not be underestimated for
convincing non-technical users of data mining results. A data mining tool
should make it easy to export results to commonly available reporting an
analysis packages such as Excel and PowerPoint.

Ability to Handle Diverse Data Types
Many data mining software packages place restrictions on the kinds of data
that can be analyzed. Before investing in a data mining software package, find
out how it deals with the various data types you want to work with.
Some tools have difficulty using categorical variables (such as model, type,
gender) as input variables and require the user to convert these into a series of
yes/no variables, one for each possible class. Others can deal with categorical
variables that take on a small number of values, but break down when faced
with too many. On the target field side, some tools can handle a binary classi­
fication task (good/bad), but have difficulty predicting the value of a categor­
ical variable that can take on several values.
Some data mining packages on the market require that continuous variables
(income, mileage, balance) be split into ranges by the user. This is especially
likely to be true of tools that generate association rules, since these require a
certain number of occurrences of the same combination of values in order to
recognize a rule.
Most data mining tools cannot deal with text, although such support is start­
ing to appear. If the text strings in the data are standardized codes (state, part
number), this is not really a problem, since character codes can easily be con­
verted to numeric or categorical ones. If the application requires the ability to
analyze free text, some of the more advanced data mining tool sets are starting
to provide support for this capability.

Documentation and Ease of Use
A well-designed user interface should make it possible to start mining right
away, even if mastery of the tool requires time and study. As with any complex
software, good documentation can spell the difference between success and
frustration. Before deciding on a tool, ask to look over the manual. It is very
Building the Data Mining Environment 537

important that the product documentation fully describes the algorithms
used, not just the operation of the tool. Your organization should not be basing
decisions on techniques that are not understood. A data mining tool that relies
on any sort of proprietary and undisclosed “secret sauce” is a poor choice.

Availability of Training for Both Novice and Advanced
Users, Consulting, and Support
It is not easy to introduce unfamiliar data mining techniques into an organiza­
tion. Before committing to a tool, find out the availability of user training and
applications consulting from the tool vendor or third parties.
If the vendor is small and geographically remote from your data mining loca­
tions, customer support may be problematic. The Internet has shrunk the planet
so that every supplier is just a few keystrokes away, but it has not altered the
human tendency to sleep at night and work in the day; time zones still matter.

Vendor Credibility
Unless you are already familiar with the vendor, it is a good idea to learn
something about its track record and future prospects. Ask to speak to refer­
ences who have used the vendor™s software and can substantiate the claims
made in product brochures.
We are not saying that you should not buy software from a company just
because it is new, small, or far away. Data mining is still at the leading edge of
commercial decision-support technology. It is often small, start-up companies
that first understand the importance of new techniques and successfully bring
them to market. And paradoxically, smaller companies often provide better,
more enthusiastic support since the people answering questions are likely to
be some people who designed and built the product.

Lessons Learned
The ideal data mining environment consists of a customer-centric corporate
culture and all the resources to support it. Those resources include data, data
miners, data mining infrastructure, and data mining software. In this ideal
data mining environment, the need for good information is ingrained in the
corporate culture, operational procedures are designed with the need to gather
good data in mind, and the requirements for data mining shape the design of
the corporate data warehouse.
Building the ideal environment is not easy. The hardest part of building a
customer-centric organization is changing the culture and how to accomplish
that is beyond the scope of this book. From a purely data perspective, the first
538 Chapter 16

step is to create a single customer view that encompasses all the relationships
the company has with a customer across all channels. The next step is to create


. 106
( 137 .)