. 83
( 137 .)




Genetic Algorithms

Like memory-based reasoning and neural networks, genetic algorithms are
based on an analogy to biological processes. Evolution and natural selection
have, over the course of millions of years, resulted in adaptable, specialized
species that are highly suited to their environments. Evolution optimizes the
fitness of individuals over succeeding generations by propagating the genetic
material in the fittest individuals of one generation to the next generation.
Genetic algorithms apply the same idea to problems where the solution
can be expressed as an optimal “individual” and the goal is to maximize the
“fitness” of individuals. Many problems can be described in this way; the chal­
lenge is encoding the problem in the right fashion. For instance, one application
of genetic algorithms is to the training of neural networks. An individual is then
a set of values for the weights inside the network; the fitness of an individual
is the accuracy of the neural network having those weights on the training set.
The training proceeds in an evolutionary way, by having more fit individuals
propagate their weights to succeeding generations. Less fit individuals”and
their genetic material”do not survive. Although chance plays a significant
role in the survival of any particular individual, over a larger population there
are enough examples of different types of individuals for natural selection to
propagate the genetic material that produces the fittest individuals.
Genetic algorithms, which are also called evolutionary algorithms, have been
applied to optimization problems in various industries, including complex
scheduling problems, resource optimization in large factories, and classification

422 Chapter 13

problems involving complex data types. They have also been used in combi­
nation with other data mining algorithms, including determining the best
topology for neural networks, determining the scoring function for memory-
based reasoning, and, as already mentioned, optimizing weights in neural net­
works. However, genetic algorithms are not commonly found in general data
mining packages.

Optimization problems have three features:
— A set of parameters (which for GAs are called genomes or chromosomes)
— A function that combines the parameters into a single value (the fitness

— A set of constraints on the parameters (for GAs these are incorporated

into the fitness function)
The goal is to find the parameters that maximize or minimize the fitness
function, subject to the constraints. Searching through all combinations of
parameters that meet the constraints is too cumbersome for even the most
advanced computers; even for a small number of parameters, the number of
combinations is too large to search.

Genetic algorithms are one approach to solving such problems, but not the
only one. When the fitness function satisfies some specific mathematical
conditions, then differential calculus can be used to find the optimal solution.
Although few functions in practice are differentiable, calculus also includes
ideas for estimating solutions in other cases. The conjugate-gradient method
for training neural networks is based on such ideas, as is the “solver” capability
in Excel.
Another approach arises with linear programming problems. These are
problems where the fitness function is linear and all the constraints are also
linear. These constraints are often met in resource allocation problems, such as:
A company produces widgets in a set of factories. Each factory
has a capacity, a cost of production, and a cost for transporting
widgets to customers. How many widgets should each factory
produce to satisfy customer demand at minimal cost?
The standard method for solving such problems is called the Simplex
method, and it is computationally efficient. Such problems have been solved
with thousands of variables. Further information on linear programming type
problems is available on the linear programming FAQ at www-unix.mcs.anl.
Another approach is called simulated annealing. This uses an analogy to a
physical process: some liquids cool and form crystalline patterns as they cool.
The crystal minimizes certain types of energy, and this happens across the
entire crystal. Scientists studying physical properties are the most common
users of simulated annealing.

Genetic Algorithms 423

The first work on genetic algorithms dates back to the late 1950s, when biol­
ogists and computer scientists worked together to model the mechanisms of
evolution on early computers. A bit later, in the early 1960s, Professor John
Holland and his colleagues at the University of Michigan applied this work on
computerized genetics”chromosomes, genes, alleles, and fitness functions”
to optimization problems. In 1967, one of Holland™s students, J. D. Bagley,
coined the term genetic algorithms in his graduate thesis to describe the opti­
mization technique. At the time, many researchers were uncomfortable with
genetic algorithms because of their dependence on random choices during the
process of evolving a solution; these choices seemed arbitrary and unpre­
dictable. In the 1970s, Prof. Holland developed a theoretical foundation for the
technique. His theory of schemata gives insight into why genetic algorithms
work”and intriguingly suggests why genetics itself creates successful, adapt­
able creatures such as ourselves.
In the world of data mining and data analysis, the use of genetic algorithms
has not been as widespread as the use of other techniques. Data mining
focuses on tasks such as classification and prediction, rather than on optimiza­
tion. Although many data mining problems can be framed as optimization
problems, this is not the usual description. For instance, a typical data mining
problem might be to predict the level of inventory needed for a given item in a
catalog based on the first week of sales, characteristics about the items in the
catalog, and its recipients. Rephrasing this as an optimization problem turns it
into something like “what function best fits the inventory curve for predictive
purposes.” Applying statistical regression techniques is one way to find the
function. Feeding the data into a neural network is another way of estimating
it. Using genetic algorithms offers another possibility. The aside “Optimiza­
tion” discusses other methods designed specifically for this purpose.
This chapter covers the background of genetics on computers, and intro­
duces the schema mechanism invented by John Holland to explain why
genetic algorithms work. It talks about two case studies, one in the area of
resource optimization and the other in the area of classifying email messages.
Although few commercial data mining products include genetic algorithms
(GAs), more specialized packages do support the algorithms. They are an
important and active area of research and may become more widely used in
the future.

How They Work
The power of genetic algorithms comes from their foundation in biology, where
evolution has proven capable of adapting life to a multitude of environments
(see sidebar “Simple Overview of Genetics”). The success of mapping the
424 Chapter 13

template for the human genome”all the common DNA shared by human
individuals”is just the beginning. The human genome has been providing
knowledge for advances in many fields, such as medical research, biochem­
istry, genetics, and even anthropology. Although interesting, human genomics
is fortunately beyond the scope of knowledge needed to understand genetic
algorithms. The language used to describe the computer technique borrows
heavily from the biological model, as discussed in the following section.

Genetics on Computers
A simple example helps illustrate how genetic algorithms work: trying to find
the maximum value of a simple function with a single integer parameter p. The
function in this example is the parabola (which looks like an upside-down
“U”) defined by 31p “ p2 where p varies between 0 and 31 (see Figure 13.1). The
parameter p is expressed as a string of 5 bits to represent the numbers from 0
to 31; this bit string is the genetic material, called a genome. The fitness function
peaks at the values 15 and 16, represented as 01111 and 10000, respectively.
This example shows that genetic algorithms are applicable even when there
are multiple, dissimilar peaks.



Fitness Value




0 5 10 15 20 25 30

Parameter p
Figure 13.1 Finding the maximum of this simple function helps illustrate
genetic algorithms.
Genetic Algorithms 425

GAs work by evolving successive generations of genomes that get progres­
sively more and more fit; that is, they provide better solutions to the problem.
In nature, fitness is simply the ability of an organism to survive and reproduce.
On a computer, evolution is simulated with the following steps:
1. Identify the genome and fitness function.
2. Create an initial generation of genomes.
3. Modify the initial population by applying the operators of genetic


4. Repeat Step 3 until the fitness of the population no longer improves.
The first step is setting up the problem. In this simple example, the genome
consists of a single, 5-bit gene for the parameter p. The fitness function is the
parabola. Over the course of generations, the fitness function is going to be
For this example, shown in Table 13.1, the initial generation consists of
four genomes, randomly produced. A real problem would typically have a
population of hundreds or thousands of genomes, but that is impractical for
illustrative purposes. Notice that in this population, the average fitness is
122.5”pretty good, since the actual maximum is 240, but evolution can
improve it.
The basic algorithm modifies the initial population using three operators”
selection, then crossover, then mutation”as illustrated in Figure 13.2. These
operators are explained in the next three sections.

Table 13.1 Four Randomly Generated Genomes


10110 22 198

00011 3 84

00010 2 58

11001 25 150
426 Chapter 13

generation n generation n + 1

this genome
this genome Selection keeps the size of the population constant but
this genome
dies off
increases the fitness of the next generation. Genomes
with a higher fitness (darker shading) proliferate and
genomes with lighter shading die off.


Crossover is a way of
combining two genomes.
A crossover position
determines where the
genomes “break” and
are recombined.

Mutation makes an occasional
mutation random change to a random position
in a genome. This allows features to
appear that may not have been in the
original population.

Figure 13.2 The basic operators in genetic algorithms are selection, crossover, and


. 83
( 137 .)