Soft Computing New
Soft Computing New
Soft computing differs from conventional or hard computing in that, unlike hard computing, it is
tolerant of imprecision, uncertainty, partial truth, and approximate reasoning. Conventional or
hard computing requires a precisely stated analytical model and a lot of mathematical and logical
computation.
Components of soft computing include:
Neural networks (NN),
Fuzzy logic (FL),
Evolutionary computation (EC), (which includes Evolutionary algorithms, Genetic algorithms,
Differential evolution),
Meta-heuristics and Swarm Intelligence, (which includes Ant colony optimization, Particle
swarm optimization, Firefly algorithm, Cuckoo search),
Support Vector Machines (SVM),
Probabilistic reasoning, (which includes Bayesian network), and Chaos theory.
NOTE:
SC is evolving rapidly
Generally speaking, soft computing techniques resemble biological processes more closely than
traditional techniques, which are largely based on formal logical systems, such as sentential logic
and predicate logic, or rely heavily on computer-aided numerical analysis (as in finite element
analysis). Soft computing techniques are intended to complement each other.
Unlike hard computing schemes, which strive for exactness and full truth, soft computing
techniques exploit the given tolerance of imprecision, partial truth, and uncertainty for a
particular problem. Another common contrast comes from the observation that inductive
reasoning plays a larger role in soft computing than in hard computing.
Genetic algorithms (GAs) are search methods based on principles of natural selection and
genetics. GAs were First described by John Holland in the 1960s and further developed by
Holland and his students and colleagues at the University of Michigan in the 1960s and 1970s.
Holland's goal was to understand the phenomenon of \adaptation" as it occurs in nature and to
develop ways in which the mechanisms of natural adaptation might be imported into computer
systems. Holland's 1975 book Adaptation in Natural and Artificial Systems (Holland,
1975/1992) presented the GA as an abstraction of biological evolution and gave a theoretical
framework for adaptation under the GA.
The population size, which is usually a user-specified parameter, is one of the important
factors affecting the scalability and performance of genetic algorithms. For example, small
population sizes might lead to premature convergence and yield substandard solutions. On the
other hand, large population sizes lead to unnecessary expenditure of valuable computational
time. Once the problem is encoded in a chromosomal manner and a fitness measure for
discriminating good solutions from bad ones has been chosen, we can start to evolve solutions to
the search problem using the following steps:
1. Initialization: The initial population of candidate solutions is usually generated randomly
across the search space. However, domain-specific knowledge or other information can be easily
incorporated.
2. Evaluation: Once the population is initialized or an offspring population is created, the fitness
values of the candidate solutions are evaluated.
3. Selection: Selection allocates more copies of those solutions with higher fitness values and
thus imposes the survival-of-the-fittest mechanism on the candidate solutions. The main idea of
selection is to prefer better solutions to worse ones, and many selection procedures have been
proposed to accomplish this idea, including roulette-wheel selection, stochastic universal
selection, ranking selection and tournament selection, some of which are described in the next
section.
4. Recombination: Recombination combines parts of two or more parental solutions to create
new, possibly better solutions (i.e. offspring). There are many ways of accomplishing this (some
of which are discussed in the next section), and competent performance depends on a properly
designed recombination mechanism. The offspring under recombination will not be identical to
any particular parent and will instead combine parental traits in a novel manner
5. Mutation: While recombination operates on two or more parental chromosomes, mutation
locally but randomly modifies a solution. Again, there are many variations of mutation, but it
usually involves one or more changes being made to an individual‟s trait or traits. In other words,
mutation performs a random walk in the vicinity of a candidate solution.
6. Replacement. The offspring population created by selection, recombination, and mutation
replaces the original parental population. Many replacement techniques such as elitist
replacement, generation-wise replacement and steady-state replacement methods are used in
GAs.
7. Repeat steps 2–6 until a terminating condition is met. Goldberg has likened GAs to
mechanistic versions of certain modes of human innovation and has shown that these operators
when analyzed individually are ineffective, but when combined together they can work well.
Remember one-variable calculus Taylor's theorem. Given a one variable function f(x), you can fit
it with a polynomial around x=a.
f(x)≈f(a)+f′(a)(x−a).
This linear approximation fits f(x) (shown in green below) with a line (shown in blue) through
x=a that matches the slope of f at a.
We can add additional, higher-order terms, to approximate f(x) better near a. The best quadratic
approximation is
f(x)≈f(a)+f′(a)(x−a)+12f′′(a)(x−a)2
f(x)≈f(a)+f′(a)(x−a)+12f′′(a)(x−a)2+16f′′′(a)(x−a)3+⋯.
The important point is that this Taylor polynomial approximates f(x) well for x near a.
f(x)=f(x1,x2,…,xn).
f(x)≈f(a)+Df(a)(x−a).
where Df(a) is the matrix of partial derivatives. The linear approximation is the first-order Taylor
polynomial.
12f′′(a)(x−a)2.
For a function of multiple variables f(x), what is analogous to the second derivative?
Since f(x) is scalar, the first derivative is Df(x), a 1×n matrix, which we can view as an n-
dimensional vector-valued function of the n-dimensional vector x. For the second derivative of
f(x), we can take the matrix of partial derivatives of the function Df(x). We could write it as
DDf(x) for the moment. This second derivative matrix is an n×n matrix called the Hessian
matrix of f. We'll denote it by Hf(x),
Hf(x)=DDf(x).
When f is a function of multiple variables, the second derivative term in the Taylor series will
use the Hessian Hf(a). For the single-variable case, we could rewrite the quadratic expression as
12(x−a)f′′(a)(x−a).
12(x−a)THf(a)(x−a).
We can add the above expression to our first-order Taylor polynomial to obtain the second-order
Taylor polynomial for functions of multiple variables:
f(x)≈f(a)+Df(a)(x−a)+12(x−a)THf(a)(x−a).
The second-order Taylor polynomial is a better approximation of f(x) near x=a than is the linear
approximation (which is the same as the first-order Taylor polynomial). We'll be able to use it
for things such as finding a local minimum or local maximum of the function f(x).
1 Boy 206 1 1
2 Martin 190 1 1
Fuzzy Set: It is a set without crisp or sharp boundary. Here the transition from “belonging to a
set” to “not belonging to a set” is gradual. This is done by assigning a membership function to
the elements or members of the set. The members of a fuzzy set get a membership value within 0
and 1. Here it is obvious that the membership value 0 indicates that the element is not a member
of the fuzzy set.
Definition: If X is a collection of objects denoted generically by x, then fuzzy set Af in X is
defined as a set of ordered pairs:
Af = { ( x, µA(x) ) | x Є X }
where µA(x) is called the membership function for the fuzzy set Af.
The membership function (MF)maps each element of X to a membership grade or membership
value between 0 and 1.
A B μA μB
Subset:
MF Formulation
1
Generalized bell MF: gbellmf (x; a, b, c)
x c
2b
1
b
Cylindrical Extension:
Fuzzy Complement:
Basic requirements:
• Boundary: T(0, 0) = 0, T(a, 1) = T(1, a) = a
• Monotonicity: T(a, b) < T(c, d) if a < c and b < d
• Commutativity: T(a, b) = T(b, a)
• Associativity: T(a, T(b, c)) = T(T(a, b), c)
Four examples (page 37):
• Minimum: Tm(a, b)
• Algebraic product: Ta(a, b)
• Bounded product: Tb(a, b)
• Drastic product: Td(a, b)
T-norms and T-conorms are duals which support the generalization of DeMorgan‟s law:
• T(a, b) = N(S(N(a), N(b)))
• S(a, b) = N(T(N(a), N(b)))
Extension Principle:
A is a fuzzy set on X : A μ A (x1 ) / x1 μ A (x2 ) / x2 L μ A (xn ) / xn
Fuzzy Relations:
A fuzzy relation R is a 2D MF: R {((x, y), μR (x, y)) | (x, y) X Y }
Examples:
• x is close to y (x and y are numbers)
• x depends on y (x and y are events)
• x and y look alike (x and y are persons or objects)
• If x is large, then y is small (x is an observed instrument reading and y is a
corresponding control action which is inversely proportional to x.)
X is close to y is plotted as shown below.
1
0.5
0
10
5
10
Y 5
0 0 X
• Associativity: R o (S o T ) (R o S ) o T
• Monotonicity: S T (R o S ) (R o T )
Max-Star Composition:
Max-product composition: μR oR (x, z) [μR (x, y)μR ( y, z)]
1 2 y 1 2
z=a z=b
y=α y=β y=γ y=δ
Example:
if (profession is athlete) then (fitness is high)
Coupling: Athletes, and only athletes, have high fitness.
The “if” statement (antecedent) is a necessary and sufficient condition.
Entailing: Athletes have high fitness, and non-athletes may or may not have high fitness.
The “if” statement (antecedent) is a sufficient but not necessary condition.
The input can be raw input data or the output of other perceptrons. The output can be the final
result (e.g. 1 means yes, 0 means no) or it can be inputs to other perceptrons.
The network:
Each ANN is composed of a collection of perceptrons grouped in layers. A typical
structure is shown in Fig.3.
Note the three layers of the shown ANN. They are input, intermediate (called the hidden
layer) and output layers.
Several hidden layers can be placed between the input and output layers.
Perceptrons:
A perceptron takes a vector of real-valued inputs, calculates a linear combination of these
inputs, then outputs
a 1 if the result is greater than some threshold
–1 otherwise.
Given real-valued inputs x1 through xn, the output o(x1, …, xn) computed by the
perceptron is given as
Some sets of positive and negative examples cannot be separated by any hyperplane.
Those that can be separated are called linearly separated set of examples.
Fig.5: (a) Linearly separable classes (b) Not linearly separable classes
- +
- -
x1
-0.8 + 0.5 x1 + 0.5 x2 = 0
OR function:
Similarly a Decision hyper-plane w0 + w1 x1 + w2 x2 = 0 with w0 = -0.3, w1 =0.5 and w2 = 0.5,
that is, the hyper-plane -0.3 + 0.5 x1 + 0.5 x2 = 0 can represent an OR function.
+ -
- +
x1
wi wi + wi
where wi = (t – o) xi
Here: t is target output value for the current training example „o‟ is perceptron output
is small constant (e.g., 0.1) called learning rate
The role of the learning rate is to moderate the degree to which weights are changed at each
step. It is usually set to some small value (e.g. 0.1) and is sometimes made to decrease as the
number of weight-tuning iterations increases. We can prove that the algorithm will converge If
training data is linearly separable and η sufficiently small.
If the data is not linearly separable, convergence is not assured.
Gradient Descent and the Delta Rule:
Although the perceptron rule finds a successful weight vector when the training examples are
linearly separable, it can fail to converge if the examples are not linearly separately. A second
training rule, called the delta rule, is designed to overcome this difficulty. The key idea of delta
where D is set of training examples, td is the target output for the training example d and Od is
the output of the linear unit for the training example d. Here we characterize E as a function of
weight vector because the linear unit output O depends on this weight vector.
Hypothesis Space: To understand the gradient descent algorithm, it is helpful to visualize the
entire space of possible weight vectors and their associated E values, as illustrated in Figure 5.
Here the axes wo, w1 represents possible values for the two weights of a simple linear unit. The
wo,w1 plane represents the entire hypothesis space. The vertical axis indicates the error E relative
to some fixed set of training examples. The error surface shown in the figure summarizes the
desirability of every weight vector in the hypothesis space. For linear units, this error surface
must be parabolic with a single global minimum. And we desire a weight vector with this
minimum.
] (3)
Notice E is itself a vector, whose components are the partial derivatives of E with respect to
each of the wi. When interpreted as a vector in weight space, the gradient specifies the direction
that produces the steepest increase in E. The negative of this vector therefore gives the direction
of steepest decrease. Since the gradient specifies the direction of steepest increase of E, the
training rule for gradient descent is
w w + w
where (4)
Here η is a positive constant called the learning rate, which determines the step size in the
gradient descent search. The negative sign is present because we want to move the weight vector
in the direction that decreases E. This training rule can also be written in its component form
wi wi + wi
where
(5)
which makes it clear that steepest descent is achieved by altering each component wi of weight
vector in proportion to E/wi. The vector of E/wi derivatives that form the gradient can be
obtained by differentiating E from Equation (2), as
where xid denotes the single input component xi for the training example d. We now have an
equation that gives E/wi in terms of the linear unit inputs xid, output od and the target value td
associated with the training example. Substituting Equation (6) into Equation (5) yields the
weight update rule for gradient descent.
(7)
The gradient descent algorithm for training linear units is as follows: Pick an initial random
weight vector. Apply the linear unit to all training examples, them compute Δwi for each weight
according to Equation (7). Update each weight wi by adding Δwi , them repeat the process. The
algorithm is given in Figure 6.
Because the error surface contains only a single global minimum, this algorithm will converge
to a weight vector with minimum error, regardless of whether the training examples are linearly
separable, given a sufficiently small η is used. If η is too large, the gradient descent search runs
the risk of overstepping the minimum in the error surface rather than settling into it. For this
reason, one common modification to the algorithm is to gradually reduce the value of η as the
number of gradient descent steps grows.
In standard gradient descent, the error is summed over all examples before
upgrading weights, whereas in stochastic gradient descent weights are updated
upon examining each training example.
The modified training rule is like the training example we update the weight
according to
Summing over multiple examples in standard gradient descent requires more computation per
weight update step. On the other hand, because it uses the true gradient, standard gradient
descent is often used with a larger step size per weight update than stochastic gradient descent
Single perceptrons can only express linear decision surfaces. In contrast, the kind of multilayer
networks learned by the backpropagation algorithm are capaple of expressing a rich variety of
nonlinear decision surfaces.This section discusses how to learn such multilayer networks using
a gradient descent algorithm similar to that discussed in the previous section.
Like the perceptron, the sigmoid unit first computes a linear combination of its inputs, then
applies a threshold to the result. In the case of sigmoid unit, however, the threshold output is a
continuous function of its input.The sigmoid function (x) is also called the logistic function.
Interesting property:
Output ranges between 0 and 1, increasing monotonically with its input.We can derive gradient
decent rules to train
The BP algorithm learns the weights for a multilayer network, given a network with a fixed set
of units and interconnections. It employs a gradient descent to attempt to minimize the squared
error between the network output values and the target values for these outputs.Because we are
considering networks with multiple output units rather than single units as before, we begin by
redefining E to sum the errors over all of the network output units
d D koutputs
where outputs is the set of output units in the network, and tkd and okd are the target and output
values associated with the kth output unit and training example d.
xij denotes the input from node i to unit j, and wij denotes the corresponding weight.
n denotes the error term associated with unit n. It plays a role analogous to the quantity (t – o)
in our earlier discussion of the delta training rule
1. Input the training example to the network and compute the network output.
2. For each output unit k
3. For each hidden unit h
In the BP algorithm, step1 propagates the input forward through the network. And the steps 2, 3
and 4 propagates the errors backward through the network.The main loop of BP repeatedly
iterates over the training examples. For each training example, it applies the ANN to the
example, calculates the error of the network output for this example, computes the gradient with
respect to the error on the example, then updates all weights in the network. This gradient
One may choose to halt after a fixed number of iterations through the loop, or once the error on
the training examples falls below some threshold, or once the error on a separate validation set of
examples meets some criteria.
Adding Momentum
Because BP is a widely used algorithm, many variations have been developed. The most
common is to alter the weight-update rule in Step 4 in the algorithm by making the weight
update on the nth iteration depend partially on the update that occurred during the (n -1)th
iteration, as follows
+α
Here wi,j(n) is the weight update performed during the n-th iteration through the main loop of
the algorithm.
- keep the ball rolling through small local minima in the error surface.
- Gradually increase the step size of the search in regions where the gradient is unchanging,
thereby speeding convergence
Boolean functions: Every boolean function can be represented by network with two
layers of units where the number of hidden units required grows exponentially.
After 5000 training iterations, the three hidden unit values encode the eight
distinct inputs using the encoding shown on the right.
Most of the interesting weight changes occurred during the first 2500 iterations.
Figure 10.a The plot shows the sum of squared errors for each of the eight output units as the
number of iterations increases. The sum of square errors for each output decreases as the
procedure proceeds, more quickly for some output units and less quickly for others.
Figure 10.b Learning the 8 3 8 network. The plot shows the evolving hidden layer
representation for the input string “010000000”. The network passes through a number of
different encodings before converging to the final encoding
Fig 10.b
Termination condition
Overfitting problem
To see the danger of minimizing the error over the training data, consider how the
error E varies with the number of weight iteration.
Weight decay: Decrease each weight by some small factor during each iteration. The motivation
for this approach is to keep weight values small.
Cross-validation: a set of validation data in addition to the training data. The algorithm monitors
the error w.r.t. this validation data while using the training set to drive the gradient descent
search.
How much weight-tuning iteration should the algorithm perform? It should use the number of
iterations that produces the lowest error over the validation set. Two copies of the weights are
kept: one copy for training and a separate copy of the best weights thus far measured by their
error over the validation set. Once the trained weights reach a higher error over the validation set
than the stored weights, training is terminated and the stored weights are returned.
Step 2: (Training and testing data separation) Trainning data must be identified, and a plan must
be made for testing the performance of the network. The available data are divided into training
and testing data sets. For a moderately sized data set, 80% of the data are randomly selected for
training, 10% for testing, and 10% secondary testing.
Step 3: (Network architecture) A network architecture and a learning method are selected.
Important considerations are the exact number of perceptrons and the number of layers.
Step 4: (Parameter tuning and weight initialization) There are parameters for tuning the network
to the desired learning performance level. Part of this step is initialization of the network weights
and parameters, followed by modification of the parameters as training performance feedback is
received. Often, the initial values are important in determining the effectiveness and length of
training.
Step 5: (Data transformation) Transforms the application data into the type and format required
by the ANN.
Step 6: (Training) Training is conducted iteratively by presenting input and desired or known
output data to the ANN. The ANN computes the outputs and adjusts the weights until the
computed outputs are within an acceptable tolerance of the known outputs for the input cases.
Step 7: (Testing) Once the training has been completed, it is necessary to test the network. The
testing examines the performance of the network using the derived weights by measuring the
ability of the network to classify the testing data correctly. Black-box testing (comparing test
results to historical results) is the primary approach for verifying that inputs produce the
appropriate outputs.
Then the network can reproduce the desired output given inputs like those in the training set. The
network is ready to use as a stand-alone system or as part of another software system where new
input data will be presented to it and its output will be a recommended decision.
Benefits of ANNs
-Robustness. ANNs tend to be more robust than their conventional counterparts. They have the
ability to cope with incomplete or fuzzy data. ANNs can be very tolerant of faults if properly
implemented.
-Fast processing speed. Because they consist of a large number of massively interconnected
processing units, all operating in parallel on the same problem, ANNs can potentially operate at
considerable speed (when implemented on parallel processors).
- Flexibility and ease of maintenance. ANNs are very flexible in adapting their behavior to new
and changing environments. They are also easier to maintain, with some having the ability to
learn from experience to improve their own performance.
Limitations of ANNs
ANNs do not produce an explicit model even though new cases can be fed into it and new
results obtained.
Providing some human characteristics to problem solving that are difficult to simulate
using the logical, analytical techniques of expert systems and standard software
technologies. (e.g. financial applications).
Recent improvements
Computational devices have been created in CMOS, for both biophysical simulation and
neuromorphic computing. More recent efforts show promise for creating nano-devices for very
large scale principal components analyses and convolution. If successful, these efforts could
usher in a new era of neural computing that is a step beyond digital computing, because it
depends on learning rather than programming and because it is fundamentally analog rather than
digital even though the first instantiations may in fact be with CMOS digital devices.
Between 2009 and 2012, the recurrent neural networks and deep feedforward neural networks
developed in the research group of Jürgen Schmidhuber at the Swiss AI Lab IDSIA have won
eight international competitions in pattern recognition and machine learning. For example, multi-
dimensional long short term memory (LSTM) won three competitions in connected handwriting
recognition at the 2009 International Conference on Document Analysis and Recognition
(ICDAR), without any prior knowledge about the three different languages to be learned.
Variants of the back-propagation algorithm as well as unsupervised methods by Geoff Hinton
and colleagues at the University of Toront can be used to train deep, highly nonlinear neural
architectures similar to the 1980 Neocognitron by Kunihiko Fukushima, [17] and the "standard
architecture of vision",[18] inspired by the simple and complex cells identified by David H. Hubel
and Torsten Wiesel in the primary visual cortex.
Deep learning feedforward networks, such as convolutional neural networks, alternate
convolutional layers and max-pooling layers, topped by several pure classification layers. Fast
GPU-based implementations of this approach have won several pattern recognition contests,
including the IJCNN 2011 Traffic Sign Recognition Competition and the ISBI 2012
Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge. Such neural
Controversies
Training issues
A common criticism of neural networks, particularly in robotics, is that they require a large
diversity of training for real-world operation. This is not surprising, since any learning machine
needs sufficient representative examples in order to capture the underlying structure that allows
it to generalize to new cases. Dean Pomerleau, in his research presented in the paper
"Knowledge-based Training of Artificial Neural Networks for Autonomous Robot Driving," uses
a neural network to train a robotic vehicle to drive on multiple types of roads (single lane, multi-
lane, dirt, etc.). A large amount of his research is devoted to (1) extrapolating multiple training
scenarios from a single training experience, and (2) preserving past training diversity so that the
system does not become over-trained (if, for example, it is presented with a series of right turns –
it should not learn to always turn right). These issues are common in neural networks that must
decide from amongst a wide variety of responses, but can be dealt with in several ways, for
example by randomly shuffling the training examples, by using a numerical optimization
GENETIC ALGORITHM
BASIC CONCEPT:
ENCODING:
As for any search and learning method, the way in which candidate solutions are encoded is a
central, if not the central, factor in the success of a genetic algorithm. Most GA applications use
fixed−length, fixed−order bit strings to encode candidate solutions. However, in recent years,
there have been many experiments with other kinds of encodings,Common approaches used are:
Binary Encoding: Every chromosome is a string of 0 or 1.Suppose we have a knapsack
of capacity C and N items, then we can encode this problem as follows Chromosome, in this case
is a string of 0s and 1s with N bits Represent item i of problem with bit in the chromosome
bit is 1 iff item has been selected, 0 otherwise. The set of all such chromosomes ( ) is
the solution space of the problem.
The example shown above has 24 items (and therefore 24 bits) with item1 selected in both
chromosome 1 and 2 whereas item2 is selected in chromosome 2 but not in chromosome 1.
Permutation Encoding (Travelling Salesman Problem): Every chromosome is a string
of numbers that represent position in a sequence.Problem description: There are cities and given
distances between them. Travelling salesman has to visit all of them, but he doesn't want to
travel more than necessary. Find a sequence of cities with a minimal travelled distance.
Chromosome A: 1 5 3 2 6 4 7 9 8
Chromosome B: 8 5 6 7 2 3 1 4 9
Encoding: Here, encoded chromosomes describe the order of cities the salesman visits. For
example, in chromosome A, the salesman visits city-1 followed by city-5 followed by city-3 and
so on.
Tree Encoding : (Genetic Programming) In tree encoding, every chromosome is a tree of
some objects, such as functions or commands in programming language. Tree encoding is useful
for evolving programs or any other structures that can be encoded in trees.
Value Encoding: Every chromosome is a sequence of some values (real numbers, characters
or objects).Direct value encoding can be used in problems, where some complicated values, such
as real numbers, are used. Use of binary encoding for this type of problems would be very
difficult. In value encoding, every chromosome is a string of some values. Values can be
anything connected to problem, form numbers, real numbers or chars to some complicated
objects.
Chromosome A 1.2324 5.3243 0.4556 2.3293 2.4545
Chromosome B ABDJEIFJDHDIERJFDLDFLFEGT
Chromosome C (back), (back), (right), (forward), (left)
Two main classes of fitness functions exist: one where the fitness function does not change, as in
optimizing a fixed function or testing with a fixed set of test cases; and one where the fitness
function is mutable, as in niche differentiation or co-evolving the set of test cases. Another way
of looking at fitness functions is in terms of a fitness landscape, which shows the fitness for each
possible chromosome. Definition of the fitness function is not straightforward in many cases and
often is performed iteratively if the fittest solutions produced by GA are not what is desired. In
some cases, it is very hard or impossible to come up even with a guess of what fitness function
definition might be. Interactive genetic algorithms address this difficulty by outsourcing
evaluation to external agents (normally humans).GAs are naturally suitable for solving
maximization problems. Maximization problems are usually transformed into maximization
problem by suitable transformation. In general, a fitness function F(i) is first derived from the
objective function and used in successive genetic operations. Fitness in biological sense is a
quality value which is a measure of the reproductive efficiency of chromosomes. In genetic
algorithm, fitness is used to allocate reproductive traits to the individuals in the population and
thus act as some measure of goodness to be maximized. This means that individuals with higher
fitness value will have higher probability of being selected as candidates for further examination.
Certain genetic operators require that the fitness function be non-negative, although certain
operators need not have this requirement. For maximization problems, the fitness function can be
considered to be the same as the objective function or F(i)=O(i). For minimization problems, to
generate non-negative values in all the cases and to reflect the relative fitness of individual
string, it is necessary to map the underlying natural objective function to fitness function form. A
number of such transformations is possible. Two commonly adopted fitness mappings are
presented below.
This transformation does not alter the location of the minimum, but converts a minimization
problem to an equivalent maximization problem. An alternate function to transform the objective
function to get the fitness value as below.
where, O(i) is the objective function value of i th individual, P is the population size and V is a
large value to ensure non-negative fitness values. The value of V adopted in this work is the
maximum value of the second term of equation so that the fitness value corresponding to
maximum value of the objective function is zero. This transformation also does not alter the
location of the solution, but converts a minimization problem to an equivalent maximization
problem. The fitness function value of a string is known as the string fitness.
SELECTION:
Chromosomes are selected from the population to be parents to crossover. The problem is how to
select these chromosomes. According to Darwin's evolution theory the best ones should survive
and create new offspring. In principle, a population of individuals selected from the search space
, often in a random manner, serves as candidate solutions to optimize the problem.The
individuals in this population are evaluated through ( "fitness" ) adaptation function. A selection
mechanism is then used to select individuals to be used as parents to those of the next generation.
These individuals will then be crossed and mutated to form the new offspring. The next
generation is finally formed by an alternative mechanism between parents and their offspring [4].
This process is repeated until a certain satisfaction condition. There are many methods how to
select the best chromosomes, for example roulette wheel selection, Boltzman selection,
tournament selection, rank selection, steady state selection and some others.
Roulette Wheel Selection :
Parents are selected according to their fitness. The better the chromosomes are, the more chances
to be selected they have. Imagine a roulette wheel where are placed all chromosomes in the
population, every has its place big accordingly to its fitness function, like on the following
picture.
Where n denotes the population size in terms of the number of individuals. A well-known
drawback of this technique is the risk of premature convergence of the GA to a local optimum,
due to the possible presence of a dominant individual that always wins the competition and is
selected as a parent.
Linear Rank Selection (LRS):
LRS is also a variant of RWS that tries to overcome the drawback of premature convergence of
the GA to a local optimum. It is based on the rank of individuals rather than on their fitness. The
rank n is accorded to the best individual whilst the worst individual gets the rank 1. Thus, based
on its rank, each individual i has the probability of being selected given by the expression
Steady-State Selection:
This is not particular method of selecting parents. Main idea of this selection is that big part of
chromosomes should survive to next generation. GA then works in a following way. In every
generation are selected a few (good - with high fitness) chromosomes for creating a new
offspring. Then some (bad - with low fitness) chromosomes are removed and the new offspring
is placed in their place. The rest of population survives to new generation.
Elitism
Idea of elitism has been already introduced. When creating new population by crossover and
mutation, we have a big chance, that we will loose the best chromosome. Elitism is name of
method, which first copies the best chromosome (or a few best chromosomes) to new population.
The rest is done in classical way. Elitism can very rapidly increase performance of GA, because
it prevents losing the best found solution.
REPRODUCTION:
After selection, individuals from the mating pool are recombined (or crossed over) to
create new, hopefully better, offspring. In the GA literature, many crossover methods have been
designed and some of them are described in this section. In most recombination operators, two
individuals are randomly selected and are recombined with a probability pc, called the crossover
probability. That is, a uniform random number, r, is generated and if r ≤ pc, the two randomly
selected individuals undergo recombination. Otherwise, that is, if r > pc, the two offspring are
simply copies of their parents. The value of pc can either be set experimentally, or can be set
based on schema-theorem principles.
k-point Crossover One-point, and two-point crossovers are the simplest and most widely
applied crossover methods. In one-point crossover, illustrated in Figure a crossover site is
selected at random over the string length, and the alleles on one side of the site are exchanged
between the individuals. In two-point crossover, two crossover sites are randomly selected. The
alleles between the two sites are exchanged between the two randomly paired individuals. Two-
point crossover is also illustrated in Figure The concept of one-point crossover can be extended
to k-point crossover, where k crossover points are used, rather than just one or two.
Order-Based Crossover The order-based crossover operator (Davis, 1985) is a variation of the
uniform order-based crossover in which two parents are randomly selected and two random
crossover sites are generated. The genes between the cut points are copied to the children.
Starting from the second crossover site copy the genes that are not already present in the
offspring from the alternative parent (the parent other than the one whose genes are copied by the
offspring in the initial phase) in the order they appear. For example, as shown in Figure, for
offspring C1, since alleles C, D, and E are copied from the parent P1, we get alleles B, G, F, and
A from the parent P2. Starting from the second crossover site, which is the sixth gene, we copy
alleles B and G as the sixth and seventh genes respectively. We then wrap around and copy
alleles F and A as the first and second genes.
The brief list, based on Goldberg (1989), of the essential differences between GAs and other
forms of optimization is the following:.
Genetic algorithms a coded form of the function values (parameter set), rather than with
the actual values them. So, for example, if we want to find the minimum of the function
f(x) =x3+x2+5, the GA would not deal directly with x or y values, but with strings that
encode these values. For this case, strings representing the binary x values should be
used.
Genetic algorithms use a set, or population, of points to conduct a search, not just a single
point on the problem space. This gives GAs the power to search noisy spaces littered with
local optimum points. Instead of relying on a single point to search through the space, the
GAs looks at many different areas of the problem space at once, and uses all of this
information to guide it.
Genetic algorithms use only payoff information to guide themselves through the problem
space. Many search techniques need a variety of information to guide themselves. Hill
climbing methods require derivatives, for example. The only information a GA needs is
some measure of fitness about a point in the space (sometimes known as an objective
function value). Once the GA knows the current measure of "goodness" about a point, it
can use this to continue searching for the optimum.
GAs are probabilistic in nature, not deterministic. This is a direct result of the
randomization techniques used by GAs.
GAs are inherently parallel. Here lies one of the most powerful features of genetic
algorithms. GAs, by their nature, are very parallel, dealing with a large number of points
(strings) simultaneously. Holland has estimated that a GA processing n strings at each
generation, the GA in reality processes n3 useful substrings.
* GA's work with string coding of variables instead of variables so that coding
discrediting the search space even though the function is continuous.
* GA does not require any auxiliary information except the objective function values.
* GA uses the probabilities in their operators.
This nature of narrowing the search spaces the search progresses is adaptive and is the
unique characteristic of Genetic Algorithms.
Optimization: GAs have been used in a wide variety of optimization tasks, including
numerical optimization as well as combinatorial optimization problems such as circuit
layout and job-shop scheduling.
Automatic Programming: GAs have been used to evolve computer programs for specific
tasks, and to design other computational structures, such as cellular automata and sorting
networks.
Machine learning: GAs have been used for many machine-learning applications,
including classification and prediction tasks such as the prediction of weather or protein
structure. GAs have also been used to evolve aspects of particular machine-learning
systems, such as weights for neural networks, rules for learning classifier systems or
symbolic production systems, and sensors for robots.
Economic models: GAs have been used to model processes of innovation, the
development of bidding strategies, and the emergence of economic markets.
Immune system models: GAs have been used to model various aspects of the natural
immune system including somatic mutation during an individual's lifetime and the
discovery of multi-gene families during evolutionary time.
Ecological models: GAs have been used to model ecological phenomena such as
biological arms races, host-parasite co-evolution, symbiosis, and resource flow in
ecologies.
Genetic Algorithm Application Areas: