<<

. 84
( 137 .)



>>

Genetic Algorithms 427


SIMPLE OVERVIEW OF GENETICS

Life depends on proteins, which consist of sequences of 20 basic units called
amino acids. The chromosomes in the nucleus of a cell are strands of DNA
(deoxyribonucleic acid) that carry the blueprints for the proteins needed by the
cell. The 23 chromosomes in each human cell together are the genome for that
individual. In general, the genomes of different individuals in a species are very
similar to each other; however, there are some individual differences.
The DNA in the genome encodes these blueprints for the amino acids
sequences using strands of nucleotides. These nucleotides constitute the four
letters of the genetic alphabet:
— A, adenine
— C, cytosine
— G, guanine
— T, thymine
Triplets of nucleotides represent the 20 amino acids. For instance, the amino
acid called methionine corresponds to the triplet ATG. Another amino acid,
lysine, has two “spellings”: AAA and AAG. So, if a strand of DNA contains the
following letters:
ATGAAGATGCGA
then it decodes into a protein containing four amino acids: methionine, ATG;
lysine, AAG; methionine, ATG; followed by arginine, CGA (see figure). This
description intentionally glosses over the details of the actual biochemical
mechanism that turns the blueprints into proteins, but it provides a high-level
outline of the mapping from genetic information in DNA to the building blocks
of proteins.
A biological example of encoding is the mapping from nucleotides in DNA to
amino acids in protein.
In this simplified model, the process of evolution works as follows. The
proteins produced by the representations in the DNA express themselves as
features of the living organism, such as blue eyes, five fingers, the structure of
the brain, a long trunk, and so on. Genes can express themselves in damaging
ways, causing the resulting organism to die. Healthy organisms survive to
produce offspring and pass their DNA to the next generation. In higher-level
animals, the DNA is actually combined with the DNA from another survivor
during sexual replication, using a technique called crossover. Sometimes,
mistakes are made in passing genes from one generation to the next”these are
mutations. The combination of all these processes over the course of many
generations results in organisms highly adapted to their environment: the
process of evolution.
(continued)
428 Chapter 13



SIMPLE OVERVIEW OF GENETICS (continued)

DNA with Amino Acids
Nucleotides for Proteins


Adenine

Methionine
Thymine

Guanine

Adenine

Lysine
Adenine

Guanine

Adenine

Methionine
Thymine

Guanine

Cytosine

Arginine
Guanine

Adenine
Genetic Algorithms 429


Selection
The selection step is analogous to the process of natural selection where only
the fittest individuals in the population survive to pass their genetic material
on to the next generation. Unlike nature, though, the size of the population
remains constant from one generation to the next, so there is no chance of the
population becoming extinct (which would clearly not be the optimum solu­
tion!). The chance of a genome surviving to the next generation is proportional
to its fitness value”the better the fitness value relative to other genomes, the
more copies that survive to the next generation. Table 13.2 shows the ratio of
the fitness of the four genomes to the population fitness. This ratio determines
the number of copies of each genome expected in the next generation.
The expected number of copies is a fraction, but the number of genomes in
the population is never fractional. Survival is based on choosing the genomes
in a random way proportional to their fitness. A random number is generated
between 0 and 1, and this random number is used to determine whether a
copy of a genome survives or not. Using the example from Table 13.2, if the
first random number is less than 0.404, then genome 10110 would be chosen; if
it is between 0.404 and 0.576 (40.4% + 17.1%), the genome 00011 would be cho­
sen, and so on. More random numbers are generated until the next generation
has the right number of genomes. Using a random number generator converts
the fractional probabilities to whole number approximations, and it also
allows some genomes with low fitness to survive.
Applying selection to the original four genomes yields the survivors shown
in Table 13.3. Notice that in general this procedure produces more copies of the
fitter genomes and fewer of the less fit. One of the less fit, 00011, has not sur­
vived this round of selection, but there are two copies of 10110, the fittest. And,
the average fitness of the population has increased from 122.5 to 151.0.

Table 13.2 Using Fitness for Selection

% OF TOTAL
POPULATION EXPECTED
GENOME FITNESS FITNESS COPIES

10110 198 40.4% 1.62

00011 84 17.1% 0.69

00010 58 11.8% 0.47

11001 150 30.6% 1.22
430 Chapter 13


Table 13.3 The Population after Selection

GENOME P FITNESS

10110 22 198

11001 25 150

00010 2 58

10110 22 198




Crossover
The next operator applied to the surviving genomes is crossover. Crossover,
which occurs in nature, creates two new genomes from two existing ones by
gluing together pieces of each one. As shown in Figure 13.2, crossover starts
with two genomes and a random position. The first part of one genome swaps
places with the first part of the second. For instance, starting with the two
genomes 10110 and 00010 and using a crossover position between the second
and third position works as follows:
10¦110
00¦010
The result of crossover is (the genes from the second genome are underlined):
10¦010
00¦110
The resulting genomes, called children, each have a piece of their chromo­
somes inherited from each of their parents. Applying crossover to the popula­
tion proceeds by selecting pairs of genomes and flipping a coin to determine
whether they split and swap. This probability is the crossover probability,
denoted by pc. If they do cross over, then a random position is chosen and the
children of the original genomes replace them in the next generation. A value
of 0.5 for the crossover probability (corresponding to a coin toss) generally
produces good results. In the example, the two genomes 10110 and 00010 are
chosen for crossover, and the position is between the second and third genes
(Table 13.4). Notice that after selection and crossover, the average fitness of the
population has gone from 122.5 to 183.0. This is a significant improvement
after only one generation.
Genetic Algorithms 431


Table 13.4 The Population after Selection and Crossover

GENOME P FITNESS

10010 18 234

11001 25 150

00110 6 150

10110 22 198




Mutation
The final operation is mutation. Mutation rarely occurs in nature and is the
result of miscoded genetic material being passed from a parent to a child. The
resulting change in the gene may represent a significant improvement in fit­
ness over the existing population, although more often than not, the results are
harmful. Selection and crossover do a good job of searching the space of pos­
sible genomes, but they depend on initial conditions and randomness that
might conspire to prevent certain valuable combinations from being consid­
ered in succeeding generations. Mutation provides the additional input. The
mutation rate is quite small in nature and is usually kept quite low for genetic
algorithms”no more than one mutation or so per generation is a reasonable
bound. For the example at hand, when a mutation occurs, the bit changes from
a 0 to a 1 or from a 1 to a 0.
Assume that there is one mutation in this generation, occurring in the sec­
ond genome at position 3. Table 13.5 shows the population of genomes after
such a mutation. Notice that this mutation, like many mutations, is destruc­
tive: The fitness of the genome affected by the mutation decreased from 150 to
58, the average fitness of the population decreased from 183.0 to 160.0, and the
resulting genome is unlikely to survive to the next generation. This is not
unusual. The primary modus operandi of genetic algorithms is selection and
crossover. Mutation is very much a second-order effect that helps avoid pre­
mature convergence to a local optimum. When the initial population provides
good coverage of the space of possible combinations, succeeding generations
move quickly toward the optimal solution by means of selection and
crossover. Changes introduced by mutation are likely to be destructive and do
not last for more than a generation or two. Yet, despite the harmful mutation
in this example, the second generation is a considerable improvement over the
original population.
432 Chapter 13


Table 13.5 The Population after Selection, Crossover, and Mutation

GENOME P FITNESS

10010 18 234

11101 29 58

00110 6 150

10110 22 198


The basis of genetic algorithms is the continual improvement of the fitness of
the population by means of selection, crossover, and mutation as genes are
passed from one generation to the next. After a certain number of generations”




Y
typically several dozen or hundred”the population evolves to a near-optimal
solution. Genetic algorithms do not always produce the exact optimal solution,




FL
but they do a very good job of getting close to the best solution. In data mining,
where exact solutions may not be feasible, being close to the best solution still
AM
yields actionable results.


Representing Data
TE

The previous example illustrated the basic mechanisms of applying genetic
algorithms to the optimization of a simple function, 31p “ p2. Since the example
was trying to maximize a particular function, the function itself served as the
fitness function. The genomes were quite easy to create, because the function
had one parameter, a 5-bit integer that varied between 0 and 31. This genome
contained a single gene, representing the parameter, and consisting of a
sequence of 5 binary bits. The choice of representation using binary sequences
is not accidental. As explained later in the section on schemata, genetic algo­
rithms work best on binary representations of data”a highly convenient cir­
cumstance, since computers themselves work most efficiently on binary data.
Genetic algorithms are different from other data mining and optimization
techniques in that they manipulate the patterns of bits in the genomes and do
not care at all about the values represented by the bits”only the fitness func­
tion knows what the patterns really mean. One requirement for the fitness
function is the ability to transform any genome into a fitness value. This
requirement does not seem particularly onerous, because computers are used

<<

. 84
( 137 .)



>>