13

Genetic Algorithms

Like memory-based reasoning and neural networks, genetic algorithms are

based on an analogy to biological processes. Evolution and natural selection

have, over the course of millions of years, resulted in adaptable, specialized

species that are highly suited to their environments. Evolution optimizes the

fitness of individuals over succeeding generations by propagating the genetic

material in the fittest individuals of one generation to the next generation.

Genetic algorithms apply the same idea to problems where the solution

can be expressed as an optimal “individual” and the goal is to maximize the

“fitness” of individuals. Many problems can be described in this way; the chal

lenge is encoding the problem in the right fashion. For instance, one application

of genetic algorithms is to the training of neural networks. An individual is then

a set of values for the weights inside the network; the fitness of an individual

is the accuracy of the neural network having those weights on the training set.

The training proceeds in an evolutionary way, by having more fit individuals

propagate their weights to succeeding generations. Less fit individuals”and

their genetic material”do not survive. Although chance plays a significant

role in the survival of any particular individual, over a larger population there

are enough examples of different types of individuals for natural selection to

propagate the genetic material that produces the fittest individuals.

Genetic algorithms, which are also called evolutionary algorithms, have been

applied to optimization problems in various industries, including complex

scheduling problems, resource optimization in large factories, and classification

421

422 Chapter 13

problems involving complex data types. They have also been used in combi

nation with other data mining algorithms, including determining the best

topology for neural networks, determining the scoring function for memory-

based reasoning, and, as already mentioned, optimizing weights in neural net

works. However, genetic algorithms are not commonly found in general data

mining packages.

OPTIMIZATION

Optimization problems have three features:

— A set of parameters (which for GAs are called genomes or chromosomes)

— A function that combines the parameters into a single value (the fitness

function)

Y

— A set of constraints on the parameters (for GAs these are incorporated

FL

into the fitness function)

The goal is to find the parameters that maximize or minimize the fitness

AM

function, subject to the constraints. Searching through all combinations of

parameters that meet the constraints is too cumbersome for even the most

advanced computers; even for a small number of parameters, the number of

combinations is too large to search.

TE

Genetic algorithms are one approach to solving such problems, but not the

only one. When the fitness function satisfies some specific mathematical

conditions, then differential calculus can be used to find the optimal solution.

Although few functions in practice are differentiable, calculus also includes

ideas for estimating solutions in other cases. The conjugate-gradient method

for training neural networks is based on such ideas, as is the “solver” capability

in Excel.

Another approach arises with linear programming problems. These are

problems where the fitness function is linear and all the constraints are also

linear. These constraints are often met in resource allocation problems, such as:

A company produces widgets in a set of factories. Each factory

has a capacity, a cost of production, and a cost for transporting

widgets to customers. How many widgets should each factory

produce to satisfy customer demand at minimal cost?

The standard method for solving such problems is called the Simplex

method, and it is computationally efficient. Such problems have been solved

with thousands of variables. Further information on linear programming type

problems is available on the linear programming FAQ at www-unix.mcs.anl.

gov/otc/Guide/faq/linear-programming-faq.html.

Another approach is called simulated annealing. This uses an analogy to a

physical process: some liquids cool and form crystalline patterns as they cool.

The crystal minimizes certain types of energy, and this happens across the

entire crystal. Scientists studying physical properties are the most common

users of simulated annealing.

Team-Fly®

Genetic Algorithms 423

The first work on genetic algorithms dates back to the late 1950s, when biol

ogists and computer scientists worked together to model the mechanisms of

evolution on early computers. A bit later, in the early 1960s, Professor John

Holland and his colleagues at the University of Michigan applied this work on

computerized genetics”chromosomes, genes, alleles, and fitness functions”

to optimization problems. In 1967, one of Holland™s students, J. D. Bagley,

coined the term genetic algorithms in his graduate thesis to describe the opti

mization technique. At the time, many researchers were uncomfortable with

genetic algorithms because of their dependence on random choices during the

process of evolving a solution; these choices seemed arbitrary and unpre

dictable. In the 1970s, Prof. Holland developed a theoretical foundation for the

technique. His theory of schemata gives insight into why genetic algorithms

work”and intriguingly suggests why genetics itself creates successful, adapt

able creatures such as ourselves.

In the world of data mining and data analysis, the use of genetic algorithms

has not been as widespread as the use of other techniques. Data mining

focuses on tasks such as classification and prediction, rather than on optimiza

tion. Although many data mining problems can be framed as optimization

problems, this is not the usual description. For instance, a typical data mining

problem might be to predict the level of inventory needed for a given item in a

catalog based on the first week of sales, characteristics about the items in the

catalog, and its recipients. Rephrasing this as an optimization problem turns it

into something like “what function best fits the inventory curve for predictive

purposes.” Applying statistical regression techniques is one way to find the

function. Feeding the data into a neural network is another way of estimating

it. Using genetic algorithms offers another possibility. The aside “Optimiza

tion” discusses other methods designed specifically for this purpose.

This chapter covers the background of genetics on computers, and intro

duces the schema mechanism invented by John Holland to explain why

genetic algorithms work. It talks about two case studies, one in the area of

resource optimization and the other in the area of classifying email messages.

Although few commercial data mining products include genetic algorithms

(GAs), more specialized packages do support the algorithms. They are an

important and active area of research and may become more widely used in

the future.

How They Work

The power of genetic algorithms comes from their foundation in biology, where

evolution has proven capable of adapting life to a multitude of environments

(see sidebar “Simple Overview of Genetics”). The success of mapping the

424 Chapter 13

template for the human genome”all the common DNA shared by human

individuals”is just the beginning. The human genome has been providing

knowledge for advances in many fields, such as medical research, biochem

istry, genetics, and even anthropology. Although interesting, human genomics

is fortunately beyond the scope of knowledge needed to understand genetic

algorithms. The language used to describe the computer technique borrows

heavily from the biological model, as discussed in the following section.

Genetics on Computers

A simple example helps illustrate how genetic algorithms work: trying to find

the maximum value of a simple function with a single integer parameter p. The

function in this example is the parabola (which looks like an upside-down

“U”) defined by 31p “ p2 where p varies between 0 and 31 (see Figure 13.1). The

parameter p is expressed as a string of 5 bits to represent the numbers from 0

to 31; this bit string is the genetic material, called a genome. The fitness function

peaks at the values 15 and 16, represented as 01111 and 10000, respectively.

This example shows that genetic algorithms are applicable even when there

are multiple, dissimilar peaks.

300

250

200

Fitness Value

150

100

50

0

0 5 10 15 20 25 30

Parameter p

Figure 13.1 Finding the maximum of this simple function helps illustrate

genetic algorithms.

Genetic Algorithms 425

GAs work by evolving successive generations of genomes that get progres

sively more and more fit; that is, they provide better solutions to the problem.

In nature, fitness is simply the ability of an organism to survive and reproduce.

On a computer, evolution is simulated with the following steps:

1. Identify the genome and fitness function.

2. Create an initial generation of genomes.

3. Modify the initial population by applying the operators of genetic

algorithms.

4. Repeat Step 3 until the fitness of the population no longer improves.

The first step is setting up the problem. In this simple example, the genome

consists of a single, 5-bit gene for the parameter p. The fitness function is the

parabola. Over the course of generations, the fitness function is going to be

maximized.

For this example, shown in Table 13.1, the initial generation consists of

four genomes, randomly produced. A real problem would typically have a

population of hundreds or thousands of genomes, but that is impractical for

illustrative purposes. Notice that in this population, the average fitness is

122.5”pretty good, since the actual maximum is 240, but evolution can

improve it.

The basic algorithm modifies the initial population using three operators”

selection, then crossover, then mutation”as illustrated in Figure 13.2. These

operators are explained in the next three sections.

Table 13.1 Four Randomly Generated Genomes

GENOME P FITNESS

10110 22 198

00011 3 84

00010 2 58

11001 25 150

426 Chapter 13

generation n generation n + 1

this genome

this genome Selection keeps the size of the population constant but

survives

this genome

dies off

increases the fitness of the next generation. Genomes

multiplies

with a higher fitness (darker shading) proliferate and

genomes with lighter shading die off.

crossover

position

Crossover is a way of

combining two genomes.

A crossover position

determines where the

genomes “break” and

are recombined.

Mutation makes an occasional

mutation random change to a random position

in a genome. This allows features to

appear that may not have been in the

original population.

Figure 13.2 The basic operators in genetic algorithms are selection, crossover, and

mutation.