<<

. 88
( 137 .)



>>

f“
ai




d
em




to


to


to


to


to
ph




”a
an
th

un


un


un


un


un




od
ng
by


by




K”
Co


Co


Co


Co


Co




“fo
“JF
Le
Is


Is




1 0 104 0 1 1 0 1 ... 1 0 ...
as
bi




...
0 0 0.1 0.3 0 0 0.1 0.2 0.9

Figure 13.6 The genome has a weight for each field in the comment signature, plus
an additional weight called a bias.
444 Chapter 13




Figure 13.7 The Genalytics system shows the progress of the training and how
the fitness function improves with each generation.




Beyond the Simple Algorithm
Researchers have been pushing the bounds on genetic algorithms in several
directions. Some of these enhancements are refinements to the basic algorithm;
others modify the algorithm to provide a better model of genetic activity in the
natural world. This work is often performed under the rubric of machine learn­
ing, an area of current artificial intelligence research that aims to enable comput­
ers to learn in ways analogous to humans. Companies such as Genalytics are
starting to apply these evolutionary techniques to marketing.
The simple genetic algorithm previously described has room for improve­
ment in several areas. One of the inefficiencies in the algorithm is the fact that
entire populations are replaced from one generation to the next. This is a gross
oversimplification of what happens in nature. Instead of replacing an entire
population, some researchers have worked with overlapping populations that
can grow in size. They have introduced the notion of crowding to determine
which existing members should be targeted for replacement. When applied
naively, this tends to result in very fast convergence, often to suboptimal solu­
tions, because all the less fit genomes are replaced before they have an oppor­
tunity to reproduce”and the less fit genomes sometimes have something to
offer. To get around this, the targets for replacement often come from subsets
of the population that exhibit high degrees of similarity.
The issue of overly fast convergence is actually a problem for the simple
genetic algorithm because the goal of finding the globally optimal solution is
Genetic Algorithms 445


easily confused by locally optimal solutions. Overly fast convergence often
suggests that the search is being limited. To get around this, the various prob­
abilities for crossover and mutation are often set high initially, then slowly
decreased from one generation to the next. Or, an initial population is allowed
to grow before shrinking in size as the fitness becomes more consistent across
the population.
The genomes discussed so far consist of only a single strand of genes. Didn™t
we learn back in high school that DNA consists of two intertwining strands in
a helix structure? And what happened to those other concepts buried back in
this high-school past, such as recessive and dominant genes? The genetics used
so far is based on the simplest chromosomes found in nature, single-stranded, or
haploid chromosomes. These tend to be found in uncomplicated, single-cell
organisms. In more complex organisms, the chromosomes are two-stranded,
or diploid, as in our own DNA.
The algorithmic characteristics of diploid chromosomes are much the same
as haploid chromosomes, since diploid chromosomes can be treated as two
chromosomes tied together. The actual algorithm proceeds in much the same
way: Selection, crossover, and mutation are the same. The difference is that
now there are two alleles for each gene (two possible values) instead of one.
When they match, there is no problem. When they do not, which does the
fitness function use? In the language of genetics, this is asking which of the
alleles is expressed. For instance, when an allele for blue eyes pairs up with an
allele for brown eyes, the brown eyes “win”; that is, they are expressed instead
of the blue eyes. (Actually, eye color is a bit more complicated than this sim­
ple example, but this is useful for explanatory purposes.) Researchers have
solved this problem by including information about dominance in the alleles
themselves. The details of this mechanism are beyond the scope of this book.
The interested reader is referred to Goldberg™s classic book Genetic Algorithms
in Search, Optimization, and Machine Learning (1989, Addison-Wesley, ISBN
0201157675)
Why we should care about diploid structures? Geneticists have long won­
dered why two-stranded chromosomes predominate in nature, when single-
stranded ones are simpler. They believe that the two-stranded structure allows
an organism to “remember” a gene that was useful in another environment,
but has become less useful in the current environment. In terms of GA,
this suggests that these are useful in cases where the environment”or fitness
function”is changing over time. In the real world, this may prove to be quite
useful. An example of a changing fitness function would be a function that
tried to determine the price of bonds over time. The goodness of a given bond
price depends on factors not under control of the algorithm, such as the rate of
inflation. The “fitness” function can take this into account by changing over
time to incorporate estimates of inflation.
446 Chapter 13


Lessons Learned

Genetic algorithms are a very powerful optimization technique. Optimization
is not at the heart of data mining, but it can solve interesting and important
problems. In fact, some data mining algorithms, such as neural networks,
depend on optimization “under the hood.”
The key to the power of genetic algorithms is that they depend on only two
things. First is the genome and the second the fitness function. The fitness
function makes sense of the genome, by producing a value from what looks
like a random set of bits. The genome encodes the problem; often it consists of
a set of weights on some equation. Genetic algorithms work on a wide variety
of fitness functions, making it possible to encode many different types of prob­
lems that are not easily handled by other means.
The process of evolution starts with a random population and then applies
three transformation steps. The first is selection, which means that more fit
genomes survive from one generation to another. This corresponds to natural
selection. The second is crossover, where two genomes swap pieces and is
analogous to a similar natural process. The third is mutation, where some val­
ues are changed randomly. Mutations are usually quite rare both in nature and
in genetic algorithms.
The application of these three processes produces a new generation, whose
average fitness should be greater than the original. As more and more genera­
tions are created, the population moves to an optimal solution. These processes
have a theoretical foundation, based on schemata. This theory explains how
genetic algorithms move toward a solution.
Genetic algorithms have been applied to practical problems, often to resource
optimization problems. However, they can even be used for predictive model­
ing and classification, as explained in the case study on classifying comments
made to an airline.
CHAPTER

14
Data Mining throughout the
Customer Life Cycle




The purpose of data mining is to help businesses realize value from their most
important asset: customers. Earlier chapters have talked about the algorithms
and methodology for making data mining successful. This chapter turns from
the specific technology to customers. The next three chapters continue this
theme, stepping away from technical algorithms to talk about the data and the
systems environment needed to exploit data mining.
For almost any business, customers are the critical asset. Yet, they are elu­
sive, because of the wide variety of different relationships that change over
time. Different industries have different definitions of customers. Within an
industry, different competitors have different approaches to managing these
relationships. Some focus on quality of service, some on convenience, some on
price, and some on other aspects of the relationship. No two businesses have
exactly the same definition of a customer, nor treat customers the same way
throughout the relationship.
The purpose of data mining is to complement other customer service initia­
tives, not to replace them. Customer interactions take place through many
channels”through direct mail pieces, through call centers, face-to-face, via
advertising. Now that the “click and mortar” way of doing business is becom­
ing standard, most businesses provide an online interface to their customers.
The Web, with its new capabilities for interacting with customers, has the
potential to provide a wealth of customer behavior data that can be turned into
a new window on the customer relationship. It is ironic that a technology that

447
448 Chapter 14


has largely replaced human-to-human interactions is allowing companies to
treat their customers more personally.
This brings us back to the customer and to the customer life cycle. This chap­
ter strives to put data mining into focus with the customer at the center. It
starts with an overview of different types of customer relationships, then goes
into the details of the customer life cycle as it relates to data mining. The chap­
ter provides examples of how customers are defined in various industries and
some of the issues in deciding when the customer relationship begins and
when it ends. The focal point is the customer and the ongoing relationship that
customers have with companies.


Levels of the Customer Relationship
One of the major goals of data mining is to understand customers and the rela­
tionships that customers have with an organization. A good place to start
understanding them better is by using the different levels of customer rela­
tionships and what customers are telling us through their behavior.
Customers generate a wealth of behavioral information. Every payment
made, every call to customer service, every click on the Web, every transaction
provides information about what each customer does, and when, and which
interventions are working and which are not. The Web is a particularly rich
source of information. CNN does not know who is viewing or paying attention

<<

. 88
( 137 .)



>>