AI&ML BM4251 Unit 1-5 Notes
AI&ML BM4251 Unit 1-5 Notes
1.1Machine learning
Machine Learning is a branch of Artificial Intelligence that allows machines to learn
and improve from experience automatically. It is defined as the field of study that gives
computers the capability to learn without being explicitly programmed. It is quite different
than traditional programming.
AI&ML 1 R.THANIGAIVEL
physically distant locations over a computer network. Most data acquisition devices are digital
now and record reliable data.
Think, for example, of a supermarket chain that has hundreds of stores all over a country
selling thousands of goods to millions of customers. The point of sale terminals record the
details of each transaction: date, customer identification code, goods bought and their amount,
total money spent, and so forth. This typically amounts to gigabytes of data every day. What
the supermarket chain wants is to be able to predict who are the likely customers for a product.
Again, the algorithm for this is not evident; it changes in time
and by geographic location. The stored data becomes useful only when it is analyzed and turned
into information that we can make use of, for example, to make predictions.
We do not know exactly which people are likely to buy this ice cream flavor, or the
next book of this author, or see this new movie, or visit this city, or click this link. If we knew,
we would not need any analysis of the data; we would just go ahead and write down the code.
But because we do not, we can only collect data and hope to extract the answers to these and
similar questions from data.
Examples
AI&ML 2 R.THANIGAIVEL
1.2 Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four
components, namely, data storage, abstraction, generalization and evaluation. Figure 1.1
illustrates the various components and the steps involved in the learning process.
Data Concepts Inferences
2. Abstraction
The second component of the learning process is known as abstraction. Abstraction is
the process of extracting knowledge about stored data. This involves creating general concepts
about the data as a whole. The creation of knowledge involves application of known models
and creation of new models. The process of fitting a model to a dataset is known as training.
When the model has been trained, the data is transformed into an abstract form that summarizes
the original information.
3. Generalization
The third component of the learning process is known as generalisation. The term
generalization describes the process of turning the knowledge about stored data into a form
that can be utilized for future action. These actions are to be carried out on tasks that are similar,
but not identical, to those what have been seen before. In generalization, the goal is to discover
those properties of the data that will be most relevant to future tasks.
4. Evaluation
Evaluation is the last component of the learning process.It is the process of giving
feedback to the user to measure the utility of the learned knowledge. This feedback is then
utilised to effect improvements in the whole learning
AI&ML 3 R.THANIGAIVEL
1.3 TYPES OF MACHINE LEARNING
In general, machine learning algorithms can be classified into three types.
Supervised learning
Unsupervised learning
Reinforcement learning
In supervised learning, each example in the training set is a pair consisting of an input
object (typically a vector) and an output value. A supervised learning algorithm analyzes the
training data and produces a function, which can be used for mapping new examples. In the
optimal case, the function will correctly determine the class labels for unseen instances. Both
classification and regression problems are supervised learning problems. A wide range of
supervised learning algorithms are available, each with its strengths and weaknesses. There is
no single learning algorithm that works best on all supervised learning problems.
Remarks
A “supervised learning” is so called because the process of an algorithm learning from
the training dataset can be thought of as a teacher supervising the learning process. We know
the correct answers (that is, the correct outputs), the algorithm iteratively makes predictions on
AI&ML 4 R.THANIGAIVEL
the training data and is corrected by the teacher. Learning stops when the algorithm achieves
an acceptable level of performance.
Example
Consider the following data regarding patients entering a clinic. The data consists of the gender
and age of the patients and each patient is labeled as “healthy” or “sick”.
M 67 Sick
F 53 Healthy
M 49 Healthy
F 34 Sick
M 21 healthy
Unsupervised learning
Correct responses are not provided, but instead the algorithm tries to identify
similarities between the inputs so that inputs that have something in common are categorised
together. The statistical approach to unsupervised learning is known as density estimation.
Gender Age
M 48
M 67
F 53
M 49
F 34
M 21
Based on this data, can we infer anything regarding the patients entering the clinic?
AI&ML 5 R.THANIGAIVEL
Reinforcement learning
This is somewhere between supervised and unsupervised learning. The algorithm gets
told when the answer is wrong, but does not get told how to correct it. It has to explore and try
out different possibilities until it works out how to get the answer right. Reinforcement learning
is sometime called learning with a critic because of this monitor that scores the answer, but
does not suggest improvements.
Reinforcement learning is the problem of getting an agent to act in the world so as to
maximize its rewards. A learner (the program) is not told what actions to take as in most forms
of machine learning, but instead must discover which actions yield the most reward by trying
them. In the most interesting and challenging cases, actions may affect not only the immediate
reward but also the next situations and, through that, all subsequent rewards.
Example
Consider teaching a dog a new trick: we cannot tell it what to do, but we can
reward/punish it if it does the right/wrong thing. It has to find out what it did that made it get
the reward/punishment. We can use a similar method to train computers to do many tasks, such
as playing backgammon or chess, scheduling jobs, and controlling robot limbs. Reinforcement
learning is different from supervised learning. Supervised learning is learning from examples
provided by a knowledgeable expert.
Learning Associations
We may want to make a distinction among customers and toward this, estimate P (Y
|X, D) where D is the set of customer attributes, for example, gender, age, marital status, and
so on, assuming that we have access to this information. If this is a bookseller instead of a
supermarket, products can be books or authors. In the case of a Web portal, items correspond
AI&ML 6 R.THANIGAIVEL
to links to Web pages, and we can estimate the links a user is likely to click and use this
information to download such pages in advance for faster access.
Classification
A credit is an amount of money loaned by a financial institution, for example, a bank,
to be paid back with interest, generally in installments.It is important for the bank to be able to
predict in advance the risk associated with a loan, which is the probability that the customer
will default and not pay the whole amount back. This is both to make sure that the bank will
make a profit and also to not inconvenience a customer with a loan over his or her financial
capacity.
In credit scoring (Hand 1998), the bank calculates the risk given the amount of credit
and the information about the customer. The information about the customer includes data we
have access to and is relevant in calculating his or her financial capacity—namely, income,
savings, collaterals, profession, age, past financial history, and so forth. The bank has a record
of past loans containing such customer data and whether the loan was paid back or not. From
this data of particular applications, the aim is to infer a general rule coding the association
between a customer’s
attributes and his risk. That is, the machine learning system fits a model to the past data to be
able to calculate the risk for a new application and then decides to accept or refuse it
accordingly.
for suitable values of θ1 and θ2 (see figure 1.1). This is an example of a discriminant;
it is a function that separates the examples of differentdiscriminant classes. Having a rule like
this, the main application is prediction: Once we haveprediction a rule that fits the past data, if
the future is similar to the past, then we can make correct predictions for novel instances. Given
a new application with a certain income and savings, we can easily decide whether it is low
risk or high-risk. In some cases, instead of making a 0/1 (low-risk/high-risk) type decision, we
may want to calculate a probability, namely, P (Y |X), where X are the customer attributes and
Y is 0 or 1 respectively for low-risk.
AI&ML 7 R.THANIGAIVEL
The following is a list of some of the typical applications of machine learning.
1. In retail business, machine learning is used to study consumer behaviour.
2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control, and troubleshooting.
4. In medicine, learning programs are used for medical diagnosis.
5. In telecommunications, call patterns are analyzed for network optimization and maximizing
the quality of service.
6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed
fast
enough by computers. The World Wide Web is huge; it is constantly growing and searching
for relevant information cannot be done manually.
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
9. Machine learning methods are applied in the design of computer-controlled vehicles to steer
correctly when driving on a variety of roads.
10. Machine learning methods have been used to develop programmes for playing games such
as chess, backgammon and Go.
1.6 Linear models for Regression:
Linear Regression
Linear regression is a type of supervised machine learning algorithm that computes the linear
relationship between the dependent variable and one or more independent features by fitting
a linear equation to observed data.
When there is only one independent feature, it is known as Simple Linear Regression, and
when there are more than one feature, it is known as Multiple Linear Regression.
Similarly, when there is only one dependent variable, it is considered Univariate Linear
Regression, while when there are more than one dependent variables, it is known
as Multivariate Regression.
AI&ML 8 R.THANIGAIVEL
Types of Linear Regression
There are two main types of linear regression:
The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
In regression set of records are present with X and Y values and these values are used to learn
a function so if you want to predict Y from an unknown X this learned function can be used.
In regression we have to find the value of Y, So, a function is required that predicts
continuous Y in the case of regression given X as independent features.
AI&ML 9 R.THANIGAIVEL
Linear Regression
Here Y is called a dependent or target variable and X is called an independent variable also
known as the predictor of Y. There are many types of functions or modules that can be used
for regression. A linear function is the simplest type of function. Here, X may be a single
feature or multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression. In the figure above, X
(input) is the work experience and Y (output) is the salary of a person. The regression line is
the best-fit line for our model.
We utilize the cost function to compute the best values in order to get the best fit line since
different values for weights or the coefficient of lines result in different regression lines.
Hypothesis function in Linear Regression
As we have assumed earlier that our independent feature is the experience i.e X and the
respective salary Y is the dependent variable. Let’s assume there is a linear relationship
between X and Y then the salary can be predicted using:
Y^=θ1+θ2X
OR
y^i=θ1+θ2xi
Here,
yiϵY(i=1,2,⋯,n) are labels to data (Supervised learning)
xiϵX(i=1,2,⋯,n) are the input independent training data (univariate – one input
variable(parameter))
yi^ϵY^(i=1,2,⋯,n) are the predicted values.
The model gets the best regression fit line by finding the best θ1 and θ2 values.
θ1: intercept
θ2: coefficient of x
AI&ML 10 R.THANIGAIVEL
Once we find the best θ1 and θ2 values, we get the best-fit line. So when we are finally using
our model for prediction, it will predict the value of y for the input value of x.
2minimizen1∑i=1n(yi^−yi)2
The cost function or the loss function is nothing but the error or difference between
the predicted value Y^ and the true value Y.
In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which
calculates the average of the squared errors between the predicted values y^i and the actual
values yi. The purpose is to determine the optimal values for the intercept 1θ1 and the
coefficient of the input feature 2θ2 providing the best-fit line for the given data points. The
linear equation expressing this relationship is y^i=θ1+θ2xi.
Utilizing the MSE function, the iterative process of gradient descent is applied to
update the values of 2θ1&θ2. This ensures that the MSE value converges to the global
minima, signifying the most accurate fit of the linear regression line to the dataset.
This process involves continuously adjusting the parameters \(\theta_1\) and \(\theta_2\)
based on the gradients calculated from the MSE. The final result is a linear regression line
that minimizes the overall squared differences between the predicted and actual values,
providing an optimal representation of the underlying relationship in the data.
A linear regression model can be trained using the optimization algorithm gradient
descent by iteratively modifying the model’s parameters to reduce the mean squared error
(MSE) of the model on a training dataset. To update θ1 and θ2 values in order to reduce the
Cost function (minimizing RMSE value) and achieve the best-fit line the model uses Gradient
Descent. The idea is to start with random θ1 and θ2 values and then iteratively update the
values, reaching minimum cost.
A gradient is nothing but a derivative that defines the effects on outputs of the function with
a little bit of variation in inputs.
Let’s differentiate the cost function(J) with respect to
AI&ML 11 R.THANIGAIVEL
Finding the coefficients of a linear equation that best fits the training data is the objective of
linear regression. By moving in the direction of the Mean Squared Error negative gradient
with respect to the coefficients, the coefficients can be changed. And the respective intercept
and coefficient of X will be if � α is the learning rate.
2. Independence: The observations in the dataset are independent of each other. This means
that the value of the dependent variable for one observation does not depend on the value
of the dependent variable for another observation. If the observations are not independent,
then linear regression will not be an accurate model.
3. Homoscedasticity: Across all levels of the independent variable(s), the variance of the
errors is constant. This indicates that the amount of the independent variable(s) has no
impact on the variance of the errors. If the variance of the residuals is not constant, then
linear regression will not be an accurate model.
AI&ML 12 R.THANIGAIVEL
4. Normality: The residuals should be normally distributed. This means that the residuals
should follow a bell-shaped curve. If the residuals are not normally distributed, then linear
regression will not be an accurate model.
Assumptions of Multiple Linear Regression
For Multiple Linear Regression, all four of the assumptions from Simple Linear Regression
apply. In addition to this, below are few more:
1. No multicollinearity: There is no high correlation between the independent variables.
This indicates that there is little or no correlation between the independent variables.
Multicollinearity occurs when two or more independent variables are highly correlated
with each other, which can make it difficult to determine the individual effect of each
variable on the dependent variable. If there is multicollinearity, then multiple linear
regression will not be an accurate model.
2. Additivity: The model assumes that the effect of changes in a predictor variable on the
response variable is consistent regardless of the values of the other variables. This
assumption implies that there is no interaction between variables in their effects on the
dependent variable.
3. Feature Selection: In multiple linear regression, it is essential to carefully select the
independent variables that will be included in the model. Including irrelevant or redundant
variables may lead to overfitting and complicate the interpretation of the model.
4. Overfitting: Overfitting occurs when the model fits the training data too closely,
capturing noise or random fluctuations that do not represent the true underlying
relationship between variables. This can lead to poor generalization performance on new,
unseen data.
Multicollinearity
Multicollinearity is a statistical phenomenon that occurs when two or more independent
variables in a multiple regression model are highly correlated, making it difficult to assess
the individual effects of each variable on the dependent variable.
Detecting Multicollinearity includes two techniques:
Correlation Matrix: Examining the correlation matrix among the independent variables
is a common way to detect multicollinearity. High correlations (close to 1 or -1) indicate
potential multicollinearity.
VIF (Variance Inflation Factor): VIF is a measure that quantifies how much the variance
of an estimated regression coefficient increases if your predictors are correlated. A high
VIF (typically above 10) suggests multicollinearity.
Evaluation Metrics for Linear Regression
A variety of evaluation measures can be used to determine the strength of any linear
regression model. These assessment metrics often give an indication of how well the model
is producing the observed outputs.
The most common measurements are:
Mean Square Error (MSE)
Mean Squared Error (MSE) is an evaluation metric that calculates the average of the squared
differences between the actual and predicted values for all the data points. The difference is
squared to ensure that negative and positive differences don’t cancel each other out.
2MSE=n1∑i=1n(yi–yi)2
AI&ML 13 R.THANIGAIVEL
Here,
n is the number of data points.
yi is the actual or observed value for the ith data point.
��^yi is the predicted value for the ith data point.
MSE is a way to quantify the accuracy of a model’s predictions. MSE is sensitive to outliers
as large errors contribute significantly to the overall score.
Mean Absolute Error (MAE)
Mean Absolute Error is an evaluation metric used to calculate the accuracy of a regression
model. MAE measures the average absolute difference between the predicted values and
actual values.
Mathematically, MAE is expressed as:
MAE=n1∑i=1n∣Yi–Yi∣
Here,
n is the number of observations
Yi represents the actual values.
Yi represents the predicted values
Lower MAE value indicates better model performance. It is not sensitive to the outliers as
we consider absolute differences.
Root Mean Squared Error (RMSE)
The square root of the residuals’ variance is the Root Mean Squared Error. It describes how
well the observed data points match the expected values, or the model’s absolute fit to the
data.
AI&ML 14 R.THANIGAIVEL
2TSS=∑(y−yi)2
R squared metric is a measure of the proportion of variance in the dependent variable that is
explained the independent variables in the model.
Adjusted R-Squared Error
Adjusted R2 measures the proportion of variance in the dependent variable that is explained
by independent variables in a regression model. Adjusted R-square accounts the number of
predictors in the model and penalizes the model for including irrelevant predictors that don’t
contribute significantly to explain the variance in the dependent variables.
Mathematically, adjusted R2 is expressed as:
AdjustedR2=1–(n−k−1(1−R2).(n−1))
Here,
n is the number of observations
k is the number of predictors in the model
R2 is coeeficient of determination
Adjusted R-square helps to prevent overfitting. It penalizes the model with additional
predictors that do not contribute significantly to explain the variance in the dependent
variable.
Given a set of input dataset of N samples {xn}, where n = 1, … , N, as well as the corresponding
target values {t n}, the goal is to deduce the value of t for new value of x. The set of input data
set together with the corresponding target values t is known as the training data set.
On way to handle this is by constructing a function y(x) that maps x to t such that:
y(x) = t
p(t|x)
AI&ML 15 R.THANIGAIVEL
This is what is generally known as linear regression.
The key attribute of this function is that it is a linear function of the parameters w0, w1,…, wD. It
is also a linear function of the input variable x. Being a linear function of the input variable x,
limits the usefulness of the function. This is because most of the observations that may be
encountered does not necessarily follow a linear relationship. To solve this problem consider
modifying to model to be a combination of fixed non-linear functions of the input variable.
If we assume that the non-linear function of the input variable is φ(x), then we can re-write the
original function as :
The total number of parameters in this function will be M, therefore the summation of terms is
from j = 1 to M.
The parameter w0 is known as the bias parameter which allows for a fixed offset in the data.
The bias is defined as the difference between the ML model’s prediction of the values and the
correct value. Biasing causes a substantial inaccuracy in both training and testing data. To
prevent the problem of underfitting, it is advised that an algorithm be low biased at all times.
The data predicted with high bias is in a straight-line format, which does not fit the data in the
data set adequately. Underfitting of data is a term used to describe this type of fitting. This
occurs when the theory is overly simplistic or linear in form.
The variance of the model is the variability of model prediction for a particular data point,
which tells us about the dispersion of the data. The model with high variance has a very
complicated fit to the training data and so is unable to fit correctly on new data.
As a result, while such models perform well on training data, they have large error rates on test
data. When a model has a large variance, this is referred to as Overfitting of Data. Variability
should be reduced to a minimum while training a data model.
AI&ML 16 R.THANIGAIVEL
Bias and variance are negatively related, therefore it is essentially difficult to have an ML
model with both a low bias and a low variance. When we alter the ML method to better match
a specific data set, it results in reduced bias but increases variance. In this manner, the model
will fit the data set while increasing the likelihood of incorrect predictions.
The same is true when developing a low variance model with a bigger bias. The model will not
fully fit the data set, even though it will lower the probability of erroneous predictions. As a
result, there is a delicate balance between biases and variance.
Since bias and variance are connected to underfitting and overfitting, decomposing the loss
into bias and variance helps us understand learning algorithms. Let’s understand certain
attributes.
AI&ML 17 R.THANIGAIVEL
Low Bias: Tends to suggest fewer implications about the target function’s shape.
High-Bias: Suggests additional assumptions about the target function’s shape.
Low Variance: Suggests minor changes to the target function estimate when the
training dataset changes.
High Variance: Suggests that changes to the training dataset cause considerable
variations in the target function estimate.
Theoretically, a model should have low bias and low variance but this is impossible to achieve.
So, an optimal bias and variance are acceptable. Linear models have low variance but high bias
and non-linear models have low bias but high variance.
Working of bias
The total error of a machine learning algorithm has three components: bias, variance and noise.
So decomposition is the process of derivation of total error in this case we are taking Mean
Squared Error (MSE).
Suppose we have a regression problem where we take in vectors and try to make predictions
of a single value. Suppose for the moment that we know the absolute true answer up to an
independent random noise. The noise should be independent of any randomness inherent in the
vector and should have a mean of zero, so that function is the best possible guess.
In the above function “R(h)” which is the cost function of the algorithm also known as the risk
function. When the risk function is loss it is the squared error. The expected function which is
represented by “E” in the above equation contains the random variables. Calculate the average
of the probability distributions for hypothesis “h”.
The data x and y are derived from the probability distribution on which the learner will be
trained. Since the weights are selected based on the training data, the weights that define h are
also obtained from the probability distribution. It can be difficult to determine this distribution,
but it does exist. The expectation function consolidates the losses of all potential weight values.
AI&ML 18 R.THANIGAIVEL
In the above image after doing all the mathematical derivation, we can observe that at last the
three components are derived bias, variance and irreducible error or noise.
In this example, we’re attempting to match a sine wave with lines, which are obviously not
realistic. On the left, we produced 50 distinct lines. The red line in the top right corner
represents the anticipated hypothesis which is an average of infinitely many possibilities. The
black curve depicts test locations along with the true function.
Because lines do not match sine waves well, we notice that most test points have a substantial
bias. Here the bias is the squared difference between the black and red curves.
Some of the test locations, however, exhibit a slight bias, where the sine wave crosses the red
line. The variance in the middle represents the predicted squared difference between a random
black line and the red line. The irreducible error is the predicted squared difference between a
random test point and the sine wave.
AI&ML 19 R.THANIGAIVEL
1.9 Bayesian Linear Regression
In the Bayesian viewpoint, we formulate linear regression using probability distributions rather
than point estimates. The response, y, is not estimated as a single value, but is assumed to be
drawn from a probability distribution. The model for Bayesian Linear Regression with the
response sampled from a normal distribution is:
The output, y is generated from a normal (Gaussian) Distribution characterized by a mean and
variance. The mean for linear regression is the transpose of the weight matrix multiplied by the
predictor matrix. The variance is the square of the standard deviation σ (multiplied by the
Identity matrix because this is a multi-dimensional formulation of the model).
The aim of Bayesian Linear Regression is not to find the single “best” value of the model
parameters, but rather to determine the posterior distribution for the model parameters. Not only
is the response generated from a probability distribution, but the model parameters are assumed
to come from a distribution as well. The posterior probability of the model parameters is
conditional upon the training inputs and outputs:
Here, P(β|y, X) is the posterior probability distribution of the model parameters given the inputs
and outputs. This is equal to the likelihood of the data, P(y|β, X), multiplied by the prior
probability of the parameters and divided by a normalization constant. This is a simple
expression of Bayes Theorem, the fundamental underpinning of Bayesian Inference:
Let’s stop and think about what this means. In contrast to OLS, we have a
posterior distribution for the model parameters that is proportional to the likelihood of the data
multiplied by the prior probability of the parameters. Here we can observe the two primary
benefits of Bayesian Linear Regression.
AI&ML 20 R.THANIGAIVEL
1. Priors: If we have domain knowledge, or a guess for what the model parameters should be,
we can include them in our model, unlike in the frequentist approach which assumes
everything there is to know about the parameters comes from the data. If we don’t have any
estimates ahead of time, we can use non-informative priors for the parameters such as a
normal distribution.
As the amount of data points increases, the likelihood washes out the prior, and in the case of
infinite data, the outputs for the parameters converge to the values obtained from OLS.
In practice, evaluating the posterior distribution for the model parameters is intractable for
continuous variables, so we use sampling methods to draw samples from the posterior in order
to approximate the posterior. The technique of drawing random samples from a distribution to
approximate the distribution is one application of Monte Carlo methods. There are a number of
algorithms for Monte Carlo sampling, with the most common being variants of Markov Chain
Monte Carlo (see this post for an application in Python).
I’ll skip the code for this post (see the notebook for the implementation in PyMC3) but the basic
procedure for implementing Bayesian Linear Regression is: specify priors for the model
parameters (I used normal distributions in this example), creating a model mapping the training
inputs to the training outputs, and then have a Markov Chain Monte Carlo (MCMC) algorithm
draw samples from the posterior distribution for the model parameters. The end result will be
posterior distributions for the parameters. We can inspect these distributions to get a sense of
what is occurring.
The first plots show the approximations of the posterior distributions of model parameters.
These are the result of 1000 steps of MCMC, meaning the algorithm drew 1000 steps from the
posterior distribution.
AI&ML 21 R.THANIGAIVEL
If we compare the mean values for the slope and intercept to those obtained from OLS (the
intercept from OLS was -21.83 and the slope was 7.17), we see that they are very similar.
However, while we can use the mean as a single point estimate, we also have a range of possible
values for the model parameters. As the number of data points increases, this range will shrink
and converge one a single value representing greater confidence in the model parameters. (In
Bayesian inference a range for a variable is called a credible interval and which has a slightly
different interpretation from a confidence interval in frequentist inference).
When we want show the linear fit from a Bayesian model, instead of showing only estimate, we
can draw a range of lines, with each one representing a different estimate of the model
AI&ML 22 R.THANIGAIVEL
parameters. As the number of datapoints increases, the lines begin to overlap because there is
less uncertainty in the model parameters.
In order to demonstrate the effect of the number of datapoints in the model, I used two models,
the first, with the resulting fits shown on the left, used 500 datapoints and the one on the right
used 15000 datapoints. Each graph shows 100 possible models drawn from the model parameter
posteriors.
Bayesian Linear Regression Model Results with 500 (left) and 15000 observations (right)
There is much more variation in the fits when using fewer data points, which represents a greater
uncertainty in the model. With all of the data points, the OLS and Bayesian Fits are nearly
identical because the priors are washed out by the likelihoods from the data. When predicting
the output for a single datapoint using our Bayesian Linear Model, we also do not get a single
value but a distribution. Following is the probability density plot for the number of calories
burned exercising for 15.5 minutes. The red vertical line indicates the point estimate from OLS.
AI&ML 23 R.THANIGAIVEL
Posterior Probability Density of Calories Burned from Bayesian Model
We see that the probability of the number of calories burned peaks around 89.3, but the full
estimate is a range of possible values.
AI&ML 24 R.THANIGAIVEL
Dimensionality reduction can help to mitigate these problems by reducing the complexity of
the model and improving its generalization performance. There are two main approaches to
dimensionality reduction: feature selection and feature extraction.
Feature Selection:
Feature selection involves selecting a subset of the original features that are most relevant to
the problem at hand. The goal is to reduce the dimensionality of the dataset while retaining
the most important features. There are several methods for feature selection, including filter
methods, wrapper methods, and embedded methods. Filter methods rank the features based
on their relevance to the target variable, wrapper methods use the model performance as the
criteria for selecting features, and embedded methods combine feature selection with the
model training process.
Feature Extraction:
Feature extraction involves creating new features by combining or transforming the original
features. The goal is to create a set of features that captures the essence of the original data
in a lower-dimensional space. There are several methods for feature extraction, including
principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed
stochastic neighbor embedding (t-SNE). PCA is a popular technique that projects the original
features onto a lower-dimensional space while preserving as much of the variance as
possible.
Why is Dimensionality Reduction important in Machine Learning and Predictive
Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these
features may overlap. In another condition, a classification problem that relies on both
humidity and rainfall can be collapsed into just one underlying feature, since both of the
aforementioned are correlated to a high degree. Hence, we can reduce the number of features
in such problems. A 3-D classification problem can be hard to visualize, whereas a 2-D one
can be mapped to a simple 2-dimensional space, and a 1-D problem to a simple line. The
below figure illustrates this concept, where a 3-D feature space is split into two 2-D feature
spaces, and later, if found to be correlated, the number of features can be reduced even
further.
AI&ML 25 R.THANIGAIVEL
Components of Dimensionality Reduction
AI&ML 27 R.THANIGAIVEL
PCA fails in cases where mean and covariance are not enough to define datasets.
We may not know how many principal components to keep- in practice, some thumb rules
are applied.
Interpretability: The reduced dimensions may not be easily interpretable, and it may be
difficult to understand the relationship between the original features and the reduced
dimensions.
Overfitting: In some cases, dimensionality reduction may lead to overfitting, especially
when the number of components is chosen based on the training data.
Sensitivity to outliers: Some dimensionality reduction techniques are sensitive to outliers,
which can result in a biased representation of the data.
Computational complexity: Some dimensionality reduction techniques, such as manifold
learning, can be computationally intensive, especially when dealing with large datasets.
Important points:
AI&ML 28 R.THANIGAIVEL
UNIT 2
NEURAL NETWORKS
2.1 BIOLOGICAL NEURONS AND THEIR ARTIFICIAL NEURONS
Biological Neurons
Neurons are the basic functional units of the nervous system, and they generate electrical
signals called action potentials, which allows them to quickly transmit information over long
distances.
Almost all the neurons have three basic functions essential for the normal functioning of all the
cells in the body.
These are to:
1. Receive signals (or information) from outside.
2. Process the incoming signals and determine whether or not the information should be passed
along.
3. Communicate signals to target cells which might be other neurons or muscles or glands.
Now let us understand the basic parts of a neuron to get a deeper insight into how they actually
work…
A biological neuron is mainly composed of 3 main parts and an external part called synapse:-
1. Dendrite
Dendrites are responsible for getting incoming signals from outside
2. Soma
Soma is the cell body responsible for the processing of input signals and deciding whether a
neuron should fire an output signal
3. Axon
Axon is responsible for getting processed signals from neuron to relevant cells
4. Synapse
Synapse is the connection between an axon and other neuron dendrites
Working of the parts
The task of receiving the incoming information is done by dendrites, and processing generally
takes place in the cell body. Incoming signals can be either excitatory — which means they
tend to make the neuron fire (generate an electrical impulse) — or inhibitory — which means
that they tend to keep the neuron from firing.
Most neurons receive many input signals throughout their dendritic trees. A single neuron may
have more than one set of dendrites and may receive many thousands of input signals. Whether
or not a neuron is excited into firing an impulse depends on the sum of all of the excitatory and
inhibitory signals it receives. The processing of this information happens in soma which is
AI&ML 29 R.THANIGAIVEL
neuron cell body. If the neuron does end up firing, the nerve impulse, or action potential, is
conducted down the axon.
Towards its end, the axon splits up into many branches and develops bulbous swellings known
as axon terminals (or nerve terminals). These axon terminals make connections on target
cells.
Artificial Neurons
Artificial neuron also known as perceptron is the basic unit of the neural network. In simple
terms, it is a mathematical function based on a model of biological neurons. It can also be seen
as a simple logic gate with binary outputs. They are sometimes also called perceptrons.
Each artificial neuron has the following main functions:
1. Takes inputs from the input layer
2. Weighs them separately and sums them up
3. Pass this sum through a nonlinear function to produce output.
AI&ML 30 R.THANIGAIVEL
3. Activation Function
Activation Function decides whether or not a neuron is fired. It decides which of the two
output values should be generated by the neuron.
4. Output Layer
Output layer gives the final output of a neuron which can then be passed to other neurons
in the network or taken as the final output value.
Now, all the above concepts might seem like too much theoretical knowledge without any
practical insights, so let’s understand the working of an artificial neuron with an example.
Consider a neuron with two inputs (x1,x2) as shown below:
AI&ML 31 R.THANIGAIVEL
Calculation of `C` value
Now the combination(C) can be fed to the activation function. Let us first understand the logic
of Rectified linear (ReLU) activation function which we are currently using in our example:
INTRODUCTION
Neural network learning methods provide a robust approach to approximating real-
valued, discrete-valued, and vector-valued target functions. For certain types of problems, such
as learning to interpret complex real-world sensor data, artificial neural networks are among
the most effective learning methods currently known. For example, the
BACKPROPAGATION algorithm described in this chapter has proven surprisingly successful
in many practical problems such as learning to recognize handwritten characters (LeCun et al.
1989), learning to recognize spoken words (Lang et al. 1990), and learning to recognize faces
(Cottrell 1990). One survey of practical applications is provided by Rumelhart et al. (1994).
Biological Motivation
The study of artificial neural networks (ANNs) has been inspired in part by the
observation that biological learning systems are built of very complex webs of interconnected
neurons. In rough analogy, artificial neural networks are built out of a densely interconnected
set of simple units, where each unit takes a number of real-valued inputs (possibly the outputs
AI&ML 32 R.THANIGAIVEL
of other units) and produces a single real-valued output (which may become the input to many
other units).
To develop a feel for this analogy, let us consider a few facts from neurobiology. The
human brain, for example, is estimated to contain a densely interconnected network of
approximately 1011neurons, each connected, on average, to lo4others. Neuron activity is
typically excited or inhibited through connections to other neurons. The fastest neuron
switching times are known to be on the order of loe3 seconds--quite slow compared to
computer switching speeds of 10-lo seconds. Yet humans are able to make surprisingly
complex decisions, surprisingly quickly. For example, it requires approximately lo-' secondsto
visually recognize your mother. Notice the sequence of neuron firings that can take place
during this 10-'-second interval cannot possibly be longer than a few hundred steps, given the
switching speed of single neurons.
While ANNs are loosely motivated by biological neural systems, there are many
complexities to biological neural systems that are not modeled by ANNs, and many features of
the ANNs we discuss here are known to be inconsistent with biological systems. For example,
we consider here ANNs whose individual units output a single constant value, whereas
biological neurons output a complex time series of spikes.
Historically, two groups of researchers have worked with artificial neural networks.
One group has been motivated by the goal of using ANNs to study and model biological
learning processes. A second group has been motivated by the goal of obtaining highly
effective machine learning algorithms, independent of whether these algorithmsmirror
biological processes. Within this book our interest fits the latter group, and therefore we will
not dwell further on biological modeling. For more information on attempts to model biological
systems using ANNs, see, for example, Churchland and Sejnowski (1992); Zornetzer et al.
(1994); Gabriel and Moore (1990).
AI&ML 33 R.THANIGAIVEL
and for distances of 90 miles on public highways (driving in the left lane of a divided public
highway, with other vehicles present).
The neural network representation used in one version of the ALVINN system, and
illustrates the kind of representation typical of many ANN systems. The network is shown on
the left side of the figure, with the input camera image depicted below it. Each node (i.e., circle)
in the network diagram corresponds to the output of a single network unit,and the lines entering
the node from below are its inputs. As can be seen, there are four units that receive inputs
directly from all of the 30 x 32 pixels in the image. These are called "hidden" units because
their output is available only within the network and is not available as part of the global
network output. Each of these four hidden units computes a single real-valued output based on
a weighted combinationof its 960 inputs. These hidden unit outputs are then used as inputs to
a second layer of 30 "output" units. Each output unit corresponds to a particular steering
direction, and the output values of these units determine which steering direction is
recommended most strongly.
The diagrams on the right side of the figure depict the learned weight values associated
with one of the four hidden units in this ANN. The large matrix of black and white boxes on
the lower right depicts the weights from the 30x 32 pixel inputs into the hidden unit. Here, a
white box indicates a positive weight, a black box a negative weight, and the size of the box
indicates the weight magnitude. The smaller rectangular diagram directly above the large
matrix shows the weights from this hidden unit to each of the 30 output units.
The network structure of ALYINN is typical of many ANNs. Here the individual units
are interconnected in layers that form a directed acyclic graph. In general, ANNs can be graphs
with many types of structures-acyclic or cyclic, directed or undirected. This chapter will focus
on the most common and practical ANN approaches, which are based on the
BACKPROPAGATION algorithm. The BACKPROPAGATION algorithm assumes the
network is a fixed structure that corresponds to a directed graph, possibly containing cycles.
Learning corresponds to choosing a weight value for each edge in the graph. Although certain
types of cycles are allowed, the vast majority of practical applications involve acyclic feed-
forward networks, similar to the network structure used by ALVINN.
AI&ML 34 R.THANIGAIVEL
Neural network learning to steer an autonomous vehicle. The ALVINN system uses
BACKPROPAGATION to learn to steer an autonomous vehicle (photo at top) driving at
speeds up to 70 miles per hour. The diagram on the left shows how the image of a forward-
mounted camera is mapped to 960 neural network inputs, which are fed forward to 4 hidden
units, connected to 30 output units. Network outputs encode the commanded steering direction.
The figure on the right shows weight values for one of the hidden units in this network. The 30
x 32 weights into the hidden unit are displayed in the large matrix, with white blocks indicating
positive and black indicating negative weights. The weights from this hidden unit to the 30
output units are depicted by the smaller rectangular block directly above the large block. As
can be seen from these output weights, activation of this particular hidden unit encourages a
turn toward the left.
~t is also applicable to problems for which more symbolic representations are often
used, such as the decision tree learning tasks discussed in Chapter 3. In these cases ANN and
decision tree learning often produce results of comparable accuracy. See Shavlik et al. (1991)
and Weiss and Kapouleas (1989) for experimental comparisons of decision tree and ANN
learning. The BACKPROPAGATION algorithm is the most commonly used ANN learning
technique. It is appropriate for problems with the following characteristics:
Instances are represented by many attribute-value pairs. The target function to be learned is
defined over instances that can be described by a vector of predefined features, such as the
AI&ML 35 R.THANIGAIVEL
pixel values in the ALVINN example. These input attributes may be highly correlated or
independent of one another. Input values can be any real values.
The training examples may contain errors. ANN learning methods are quite robust to noise
in the training data.
Long training times are acceptable. Network training algorithms typically require longer
training times than, say, decision tree learning algorithms. Training times can range from a few
seconds to many hours, depending on factors such as the number of weights in the network,
the number of training examples considered, and the settings of various learning algorithm
parameters.
Fast evaluation of the learned target function may be required. Although ANN learning times
are relatively long, evaluating the learned network, in order to apply it to a subsequent instance,
is typically very fast. For example, ALVINN applies its neural network several times per
second to continually update its steering command as the vehicle drives forward.
I The ability of humans to understand the learned target function is not important. The
weights learned by neural networks are often difficult for humans to interpret. Learned neural
networks are less easily communicated to humans than learned rules.
The rest of this chapter is organized as follows: We first consider several alternative
designs for the primitive units that make up artificial neural networks (perce~trons, linear units,
and sigmoid units), along with learning algorithms for training single units. We then present
the BACKPROPAGATION algorithm for training multilayer networks of such units and
consider several general issues such as the representational capabilities of ANNs, nature of the
hypothesis space search, overfitting problems, and alternatives to the BACKPROPAGATION
algorithm.
Learning rule enhances the Artificial Neural Network’s performance by applying this
rule over the network. Thus learning rule updates the weights and bias levels of a network
when certain conditions are met in the training process. it is a crucial part of the development
of the Neural Network.
Types Of Learning Rules in ANN
AI&ML 36 R.THANIGAIVEL
1. Hebbian Learning Rule
AI&ML 37 R.THANIGAIVEL
y=actual output
wo=initial weight
wnew=new weight
δw=change in weight
α=learning rate
actual output(y)=wixi
learning signal(ej)=ti-y (difference between desired and actual output)
δw=αxiej
wnew=wo+δw
Now, the output can be calculated on the basis of the input and the activation function applied
over the net input and can be expressed as:
y=1, if net input>=θ
y=0, if net input<θ
It was developed by Bernard Widrow and Marcian Hoff and It depends on supervised
learning and has a continuous activation function. It is also known as the Least Mean Square
method and it minimizes error over all the training patterns.
It is based on a gradient descent approach which continues forever. It states that the
modification in the weight of a node is equal to the product of the error and the input where
the error is the difference between desired and actual output.
Computed as follows:
AI&ML 38 R.THANIGAIVEL
Assume (x1,x2,x3……………………….xn) –>set of input vectors
and (w1,w2,w3…………………..wn) –>set of weights
y=actual output
wo=initial weight
wnew=new weight
δw=change in weight
Error= ti-y
Learning signal(ej)=(ti-y)y’
y=f(net input)= ∫wixi
δw=αxiej=αxi(ti-y)y’
wnew=wo+δw
The updating of weights can only be done if there is a difference between the target and
actual output(i.e., error) present:
case I: when t=y
then there is no change in weight
case II: else
wnew=wo+δw
The correlation learning rule follows the same similar principle as the Hebbian
learning rule,i.e., If two neighbor neurons are operating in the same phase at the same period
of time, then the weight between these neurons should be more positive. For neurons
operating in the opposite phase, the weight between them should be more negative but unlike
the Hebbian rule, the correlation rule is supervised in nature here, the targeted response is
used for the calculation of the change in weight.
In Mathematical form:
δw=αxitj
where δw=change in weight,α=learning rate,xi=set of the input vector, and tj=target value
AI&ML 39 R.THANIGAIVEL
5. Out Star Learning Rule
Out Star Learning Rule is implemented when nodes in a network are arranged in a layer.
Here the weights linked to a particular node should be equal to the targeted outputs for the
nodes connected through those same weights. Weight change is thus calculated as=δw=α(t-
y)
Where α=learning rate, y=actual output, and t=desired output for n layer nodes.
AI&ML 40 R.THANIGAIVEL
It is also known as the Winner-takes-All rule and is unsupervised in nature. Here all the
output nodes try to compete with each other to represent the input pattern and the winner is
declared according to the node having the most outputs and is given the output 1 while the
rest are given 0.
There are a set of neurons with arbitrarily distributed weights and the activation function is
applied to a subset of neurons. Only one neuron is active at a time. Only the winner has
updated weights, the rest remain unchanged
where each wi is a real-valued constant,or weight, that determines the contribution of input xi
to the perceptron output. Notice the quantity (-wO) is a threshold that
the weighted combination of inputs wlxl +. . .+ wnxn must surpass in order for
the perceptron to output a
PERCEPTRON CLASSIFIER
AI&ML 41 R.THANIGAIVEL
of the earliest and most basic machine learning methods used for binary classification is the
perceptron. Frank Rosenblatt created it in the late 1950s, and it is a key component of more
intricate neural network topologies.
1. Input Features (x): Predictions are based on the characteristics or qualities of the input
data, or input features (x). A number value is used to represent each feature. The two
classes in binary classification are commonly represented by the numbers 0 (negative
class) and 1 (positive class).
2. Input Weights (w): Each input information has a weight (w), which establishes its
significance when formulating predictions. The weights are numerical numbers as well
and are either initialized to zeros or small random values.
3. Weighted Sum ( ): To calculate the weighted sum, use the dot product of the input
features’ (x) weights and their associated features’ (w) weights. Mathematically, it is
written as .
AI&ML 42 R.THANIGAIVEL
2. Prediction: The Perceptron calculates the weighted total ( ) of the input features
and weights in order to provide a forecast for a particular input.
3. Activation Function: Following the computation of the weighted sum ( ), an
activation function is used. The perceptron outputs 1 (positive class) if is greater
than or equal to a specific threshold; otherwise, it outputs 0 (negative class) because the
activation function is a step function.
4. Updating Weight: Weights are updated if a misclassification, or an inaccurate prediction,
is made by the perceptron. The weight update is carried out to reduce prediction
inaccuracy in the future. Typically, the update rule involves shifting the weights in a way
that lowers the error. The perceptron learning rule, which is based on the discrepancy
between the expected and actual class labels, is the most widely used rule.
5. Repeat: Each input data point in the training dataset is repeated through steps 2 through
4 one more time. This procedure keeps going until the model converges and accurately
categorizes the training data, which could take a certain amount of iterations.
AI&ML 43 R.THANIGAIVEL
This perceptron class serves as a simple binary classifier, but it's limited to linearly separable
problems. For more complex tasks, more advanced neural network architectures are typically
used.
PROGRAM
import tensorflow as tf
from tensorflow.compat.v1.losses import sparse_softmax_cross_entropy
class Perceptron(tf.Module):
def __init__(self, num_inputs):
super(Perceptron, self).__init__()
self.weights = tf.Variable(tf.random.normal(shape=(num_inputs, 1)))
self.bias = tf.Variable(tf.zeros(shape=(1,)))
# Example usage
num_inputs = 2
perceptron = Perceptron(num_inputs)
# Sample input
input_data = tf.constant([[0., 0.], [0., 1.], [1., 0.], [1., 1.]], dtype=tf.float32)
AI&ML 44 R.THANIGAIVEL
OUTPUT:
[[0.]
[0.]
[0.]
[0.]]
The BACKPROPAGATION algorithm learns the weights for a multilayer network, given a
network with a fixed set of units and interconnections. It employs gradient descent to attempt
to minimize the squared error between the network output values and the target values for these
outputs. This section presents the BACKPROPAGATION algorithm, and the following section
gives the derivation for the gradient descent weight update rule used by
BACKPROPAGATION.
Because we are considering networks with multiple output units rather than single units
as before, we begin by redefining E to sum the errors over all of the network output units
where outputs is the set of output units in the network, and tkd and OM are the I target and
output values associatedwith the kth output unit and training exampled.
AI&ML 45 R.THANIGAIVEL
3. For each hidden unit h, calculate its error term 6h
4. Update each network weight wji
where
Aw..-
Jl - I 11
TABLE 4.2 The stochasticgradient descent version of the BACKPROPAGATION algorithm
for feedforward networks containing two layers of sigmoid units.
One major differencein the case of multilayer networks is that the error surface can have
multiple local minima, in contrast to the single-minimum parabolic error surface shown in
Figure 4.4. Unfortunately, this means that gradient descent is guaranteed only to converge
toward some local minimum, and not necessarily the global minimum error. Despite this
obstacle, in practice BACKPROPAGATION has been found to produce excellent results in
many real-world applications. The BACKPROPAGATION algorithm is presented in Table
4.2. The algorithm as described here applies to layered feedforward networks containing two
layers of sigmoid units, with units at each layer connected to all units from the preceding layer.
This is the incremental, or stochastic, gradient descent version of BACKPROPAGATION. The
notation used here is the same as that used in earlier sections, with the following extensions:
AI&ML 46 R.THANIGAIVEL
algorithm). However, since training examples provide target values tk only for network
outputs, no target values are directly available to indicate the error of hidden units' values.
Instead, the error term for hidden unit h is calculated by summing the error terms J k for each
output unit influenced by h, weighting each of the ak's by wkh,the weight from hidden unit h
to output unit k. This weight characterizes the degree to which hidden unit h is "responsible
for" the error in output unit k.
I The algorithm in Table 4.2 updates weights incrementally, following the
I Presentation of each training example. This corresponds to a stochastic approximation to
gradient descent. To obtain the true gradient of E one would sum the 6, x,, values over all
training examples before altering weight values.
ADDING MOMENTUM
Here Awji(n) is the weight update performed during the nth iteration through the main
loop of the algorithm, and 0 5 a < 1 is a constant called the momentum. Notice the first term
on the right of this equation is just the weight-update rule of Equation (T4.5) in the
BACKPROPAGATION algorithm. The second term on the right is new and is called the
momentum term. To see the effect of this momentum term, consider that the gradient descent
search trajectory is analogous to that of a (momentumless) ball rolling down the error surface.
The effect of a! is to add momentum that tends to keep the ball rolling in the same direction
from one iteration to the next. This can sometimes have the effect of keeping the ball rolling
through small local minima in the error surface, or along flat regions in the surface where the
ball would stop if there were no momentum. It also has the effect of gradually increasing the
step size of the search in regions where the gradient is unchanging, thereby speeding
convergence.
AI&ML 47 R.THANIGAIVEL
general, the 6, value for a unit r in layer rn is computed from the 6 values at the
next deeper layer rn + 1 according to
Notice this is identical to Step 3 in the algorithm of Table 4.2, so all we are really saying here
is that this step may be repeated for any number of hidden layers in the network.
It is equally straightforward to generalize the algorithm to any directed acyclic graph,
regardless of whether the network units are arranged in uniform layers as we have assumed up
to now. In the case that they are not, the rule for calculating 6 for any internal unit (i.e., any
unit that is not an output) is
where Downstream(r) is the set of units immediately downstream from unit r in the network:
that is, all units whose inputs include the output of unit r. It is this gneral form of the weight-
update rule that we derive in Section
AI&ML 48 R.THANIGAIVEL
The derivatives &(tk -ok12 will be zero for all output units k except when k = j. We therefore
drop the summation over output units and simply set k = j.
Next consider the second term in Equation (4.23). Since oj = a(netj),the
derivative $ is just the derivative of the sigmoid function, which we have
already noted is equal to a(netj)(l- a(netj)).Therefore,
Substituting expressions (4.24) and (4.25) into (4.23), we obtain
and combining this with Equations (4.21) and (4.22), we have the stochastic gradient descent
rule for output units
Note this training rule is exactly the weight update rule implemented by Equations (T4.3) and
(T4.5) in the algorithm of Table 4.2. Furthermore, we can see
now that Sk in Equation (T4.3) is equal to the quantity -$. In the remainder
of this section we will use Si to denote the quantity -% for an arbitrary unit i .
Case 2: Training Rule for Hidden Unit Weights. In the case where j is an internal, or hidden
unit in the network, the derivation of the training rule for wji must take into account the indirect
ways in which wji can influence the network outputs and hence Ed. For this reason, we will
find it useful to refer to the set of all units immediately downstream of unit j in the network
(i.e., all units whose direct inputs include the output of unit j). We denote this set of units by
Downstream( j). Notice that netj can influence the network outputs (and therefore E d ) only
through the units in Downstream(j). Therefore, we can write
Rearranging terms and using S j to denote-$,we have
and
which is precisely the general rule from Equation (4.20) for updating internal unit weights in
arbitrary acyclic directed graphs. Notice Equation (T4.4) from Table 4.2 is just a special case
of this rule, in which Downstream(j)= outputs.
he delta rule in an artificial neural network is a specific kind of backpropagation that assists in
refining the machine learning/artificial intelligence network, making associations among input
and outputs with different layers of artificial neurons. The Delta rule is also called the Delta
learning rule.
Generally, backpropagation has to do with recalculating input weights for artificial neurons
utilizing a gradient technique. Delta learning does this by using the difference between a target
activation and an obtained activation. By using a linear activation function, network
connections are balanced. Another approach to explain the Delta rule is that it uses an error
function to perform gradient descent learning.
Delta rule refers to the comparison of actual output with a target output, the technology tries to
discover the match, and the program makes changes. The actual execution of the Delta rule
will fluctuate as per the network and its composition. Still, by applying a linear activation
function, the delta rule can be useful in refining a few sorts of neural networks with specific
kinds of backpropagation.
AI&ML 49 R.THANIGAIVEL
Delta rule is introduced by Widrow and Hoff, which is the most significant learning rule that
depends on supervised learning.
This rule states that the change in the weight of a node is equivalent to the product of error and
the input.
Mathematical equation:
The given equation gives the mathematical equation for delta learning rule:
∆w = µ.x.z
∆w = µ(t-y)x
Here,
∆w = weight change.
z= (t-y) is the difference between the desired input t and the actual output y. The above
mentioned mathematical rule cab be used only for a single output unit.
The different weights can be determined with respect to these two cases.
w(new) = w(old) + ∆w
No change in weight
For a given input vector, we need to compare the output vector, and the final output vector
would be the correct answer. If the difference is zero, then no learning takes place, so we need
to adjust the weight to reduce the difference. If the set of input patterns is taken from an
independent set, then it uses learn arbitrary connections using the delta learning rule. It has
examined for networks with linear activation function with no hidden units. The error squared
versus the weight graph is a paraboloid shape in n-space. The proportionality constant is
negative, so the graph of such a function is concave upward with the least value. The vertex of
the paraboloid represents the point where it decreases the error. The weight vector is comparing
this point with the ideal weight vector. We can utilize the delta learning rule with both single
output units and numerous output units. When we are applying the delta learning rule is to
diminish the difference between the actual and probable output, we find an error.
AI&ML 50 R.THANIGAIVEL
2.6 ASSOCIATIVE MEMORY
An associate memory network refers to a content addressable memory structure that associates
a relationship between the set of input patterns and output patterns. A content addressable
memory structure is a kind of memory structure that enables the recollection of data based on
the intensity of similarity between the input pattern and the patterns stored in the memory.
The figure given below illustrates a memory containing the names of various people. If the
given memory is content addressable, the incorrect string "Albert Einstein" as a key is
sufficient to recover the correct name "Albert Einstein."
In this condition, this type of memory is robust and fault-tolerant because of this type of
memory model, and some form of error-correction capability.
Note: An associate memory is obtained by its content, adjacent to an explicit address in the
traditional computer memory system. The memory enables the recollection of information
based on incomplete knowledge of its contents.
There are two types of associate memory- an auto-associative memory and hetero associative
memory.
Auto-associative memory:
An auto-associative memory recovers a previously stored pattern that most closely relates to
the current pattern. It is also known as an auto-associative correlator.
AI&ML 51 R.THANIGAIVEL
Consider x[1], x[2], x[3],….. x[M], be the number of stored pattern vectors, and let x[m] be
the element of these vectors, showing characteristics obtained from the patterns. The auto-
associative memory will result in a pattern vector x[m] when putting a noisy or incomplete
version of x[m].
Hetero-associative memory:
In a hetero-associate memory, the recovered pattern is generally different from the input pattern
not only in type and format but also in content. It is also known as a hetero-
associative correlator.
Consider we have a number of key response pairs {a(1), x(1)}, {a(2),x(2)},…..,{a(M), x(M)}.
The hetero-associative memory will give a pattern vector x(m) when a noisy or incomplete
version of the a(m) is given.
Neural networks are usually used to implement these associative memory models called neural
associative memory (NAM). The linear associate is the easiest artificial neural associative
memory.
ADVERTISEMENT
AI&ML 52 R.THANIGAIVEL
These models follow distinct neural network architecture to memorize data.
Associative memory is a depository of associated pattern which in some form. If the depository
is triggered with a pattern, the associated pattern pair appear at the output. The input could be
an exact or partial representation of a stored pattern.
If the memory is produced with an input pattern, may say α, the associated pattern ω is
recovered automatically.
These are the terms which are related to the Associative memory network:
Encoding or memorization:
Where,
Where,
AI&ML 53 R.THANIGAIVEL
strong>i = 1,2, …,m and j = 1,2,…,n.
ADVERTISEMENT
The input pattern may hold errors and noise or may contain an incomplete version of some
previously encoded pattern. If a corrupted input pattern is presented, the network will recover
the stored Pattern that is adjacent to the actual input pattern. The existence of noise or errors
results only in an absolute decrease rather than total degradation in the efficiency of the
network. Thus, associative memories are robust and error-free because of many processing
units performing highly parallel and distributed computations.
Performance Measures:
The measures taken for the associative memory performance to correct recovery are memory
capacity and content addressability. Memory capacity can be defined as the maximum
number of associated pattern pairs that can be stored and correctly recovered. Content-
addressability refers to the ability of the network to recover the correct stored pattern.
If input patterns are mutually orthogonal, perfect recovery is possible. If stored input patterns
are not mutually orthogonal, non-perfect recovery can happen due to intersection among the
patterns.
Linear associator is the simplest and most widely used associative memory models. It is a
collection of simple processing units which have a quite complex collective computational
capability and behavior. The Hopfield model computes its output that returns in time until the
system becomes stable. Hopfield networks are constructed using bipolar units and a learning
process. The Hopfield model is an auto-associative memory suggested by John
Hopfield in 1982. Bidirectional Associative Memory (BAM) and the Hopfield model are
some other popular artificial neural network models used as associative memories.
The neural associative memory models pursue various neural network architectures to
memorize data. The network comprises either a single layer or two layers. The linear associator
AI&ML 54 R.THANIGAIVEL
model refers to a feed-forward type network, comprises of two layers of different processing
units- The first layer serving as the input layer while the other layer as an output layer. The
Hopfield model refers to a single layer of processing elements where each unit is associated
with every other unit in the given network. The bidirectional associative memory
(BAM) model is the same as the linear associator, but the associations are bidirectional.
The neural network architectures of these given models and the structure of the
corresponding association weight matrix w of the associative memory are depicted.
The linear associator model is a feed-forward type network where produced output is
in the form of single feed-forward computation. The model comprises of two layers of
processing units, one work as an input layer while the other work as an output layer. The input
is directly associated with the outputs, through a series of weights. The connections carrying
weights link each input to every output. The addition of the products of the weights and the
input is determined in each neuron node. The architecture of the linear associator is given
below.
All p inputs units are associated to all q output units via associated weight matrix
W = [wij]p * q where wij describes the strength of the unidirectional association of the ith input
unit to the jth output unit.
AI&ML 55 R.THANIGAIVEL
The connection weight matrix stores the z different associated pattern pairs {(Xk,Yk); k=
1,2,3,…,z}. Constructing an associative memory is building the connection weight
matrix w such that if an input pattern is presented, the stored pattern associated with the input
pattern is recovered.
The Adaptive Resonance Theory (ART) was incorporated as a hypothesis for human
cognitive data handling. The hypothesis has prompted neural models for pattern recognition
and unsupervised learning. ART system has been utilized to clarify different types of cognitive
and brain data.
It can be defined as the formal analysis of how to overcome the learning instability
accomplished by a competitive learning model, let to the presentation of an expended
hypothesis, called adaptive resonance theory (ART). This formal investigation indicated that
a specific type of top-down learned feedback and matching mechanism could significantly
overcome the instability issue. It was understood that top-down attentional mechanisms, which
had prior been found through an investigation of connections among cognitive and
reinforcement mechanisms, had similar characteristics as these code-stabilizing mechanisms.
In other words, once it was perceived how to solve the instability issue formally, it also turned
out to be certain that one did not need to develop any quantitatively new mechanism to do so.
One only needed to make sure to incorporate previously discovered attentional mechanisms.
These additional mechanisms empower code learning to self- stabilize in response to an
essentially arbitrary input system. Grossberg presented the basic principles of the adaptive
resonance theory. A category of ART called ART1 has been described as an arrangement of
ordinary differential equations by carpenter and Grossberg. These theorems can predict both
the order of search as the function of the learning history of the system and the input patterns.
AI&ML 56 R.THANIGAIVEL
Gains control empowers L1 and L2 to recognize the current stages of the running cycle.
STM reset wave prevents active L2 cells when mismatches between bottom-up and top-down
signals happen at L1. The comparison layer gets the binary external input passing it to the
recognition layer liable for coordinating it to a classification category. This outcome is given
back to the comparison layer to find out when the category coordinates the input vector.
If there is a match, then a new input vector is read, and the cycle begins once again. If
there is a mismatch, then the orienting system is in charge of preventing the previous category
from getting a new category match in the recognition layer. The given two gains control the
activity of the recognition and the comparison layer, respectively. The reset wave specifically
and enduringly prevents active L2 cell until the current is stopped. The offset of the input
pattern ends its processing L1 and triggers the offset of Gain2. Gain2 offset causes consistent
decay of STM at L2 and thereby prepares L2 to encode the next input pattern without bais.
AI&ML 57 R.THANIGAIVEL
ART1 Implementation process:
ART1 is a self-organizing neural network having input and output neurons mutually couple
using bottom-up and top-down adaptive weights that perform recognition. To start our
methodology, the system is first trained as per the adaptive resonance theory by inputting
reference pattern data under the type of 5*5 matrix into the neurons for clustering within the
output neurons. Next, the maximum number of nodes in L2 is defined following by the
vigilance parameter. The inputted pattern enrolled itself as short term memory activity over a
field of nodes L1. Combining and separating pathways from L1 to coding field L2, each
weighted by an adaptive long-term memory track, transform into a net signal vector T. Internal
competitive dynamics at L2 further transform T, creating a compressed code or content
addressable memory. With strong competition, activation is concentrated at the L2 node that
gets the maximal L1 → L2 signal. The primary objective of this work is divided into four
phases as follows Comparision, recognition, search, and learning.
It can be coordinated and utilized with different techniques to give more precise outcomes.
It can be used in different fields such as face recognition, embedded system, and robotics, target
recognition, medical diagnosis, signature verification, etc.
It shows stability and is not disturbed by a wide range of inputs provided to inputs.
It has got benefits over competitive learning. The competitive learning cant include new
clusters when considered necessary.
Application of ART:
AI&ML 58 R.THANIGAIVEL
ART stands for Adaptive Resonance Theory. ART neural networks used for fast, stable
learning and prediction have been applied in different areas. The application incorporates target
recognition, face recognition, medical diagnosis, signature verification, mobile control robot.
Target recognition:
Fuzzy ARTMAP neural network can be used for automatic classification of targets
depend on their radar range profiles. Tests on synthetic data show the fuzzy ARTMAP can
result in substantial savings in memory requirements when related to k nearest neighbor(kNN)
classifiers. The utilization of multiwavelength profiles mainly improves the performance of
both kinds of classifiers.
Medical diagnosis:
Signature verification:
Automatic signature verification is a well known and active area of research with
various applications such as bank check confirmation, ATM access, etc. the training of the
network is finished using ART1 that uses global features as input vector and the verification
and recognition phase uses a two-step process. In the initial step, the input vector is coordinated
with the stored reference vector, which was used as a training set, and in the second step, cluster
formation takes place.
AI&ML 59 R.THANIGAIVEL
Nowadays, we perceive a wide range of robotic devices. It is still a field of research in
their program part, called artificial intelligence. The human brain is an interesting subject as a
model for such an intelligent system. Inspired by the structure of the human brain, an artificial
neural emerges. Similar to the brain, the artificial neural network contains numerous simple
computational units, neurons that are interconnected mutually to allow the transfer of the signal
from the neurons to neurons. Artificial neural networks are used to solve different issues with
good outcomes compared to other decision algorithms.
Limitations of ART:
Some ART networks are contradictory as they rely on the order of the training data, or upon
the learning rate.
AI&ML 60 R.THANIGAIVEL
UNIT -3
The term fuzzy refers to things that are not clear or are vague. In the real world many
times we encounter a situation when we can’t determine whether the state is true or false,
their fuzzy logic provides very valuable flexibility for reasoning. In this way, we can consider
the inaccuracies and uncertainties of any situation.
Fuzzy Logic is a form of many-valued logic in which the truth values of variables
may be any real number between 0 and 1, instead of just the traditional values of true or false.
It is used to deal with imprecise or uncertain information and is a mathematical method for
representing vagueness and uncertainty in decision-making.
Fuzzy Logic is based on the idea that in many cases, the concept of true or false is too
restrictive, and that there are many shades of gray in between. It allows for partial truths,
where a statement can be partially true or false, rather than fully true or false.
Fuzzy Logic is used in a wide range of applications, such as control systems, image
processing, natural language processing, medical diagnosis, and artificial intelligence.
The fundamental concept of Fuzzy Logic is the membership function, which defines the
degree of membership of an input value to a certain set or category. The membership function
is a mapping from an input value to a membership degree between 0 and 1, where 0 represents
non-membership and 1 represents full membership.
Fuzzy Logic is implemented using Fuzzy Rules, which are if-then statements that
express the relationship between input variables and output variables in a fuzzy way. The
output of a Fuzzy Logic system is a fuzzy set, which is a set of membership degrees for each
possible output value.
In the boolean system truth value, 1.0 represents the absolute truth value and 0.0
represents the absolute false value. But in the fuzzy system, there is no logic for the absolute
truth and absolute false value. But in fuzzy logic, there is an intermediate value too present
which is partially true and partially false.
AI&ML 61 R.THANIGAIVEL
ARCHITECTURE
RULE BASE: It contains the set of rules and the IF-THEN conditions provided by the
experts to govern the decision-making system, on the basis of linguistic information.
Recent developments in fuzzy theory offer several effective methods for the design and
tuning of fuzzy controllers. Most of these developments reduce the number of fuzzy rules.
FUZZIFICATION: It is used to convert inputs i.e. crisp numbers into fuzzy sets. Crisp
inputs are basically the exact inputs measured by sensors and passed into the control
system for processing, such as temperature, pressure, rpm’s, etc.
INFERENCE ENGINE: It determines the matching degree of the current fuzzy input with
respect to each rule and decides which rules are to be fired according to the input field.
Next, the fired rules are combined to form the control actions.
AI&ML 62 R.THANIGAIVEL
Membership function
Definition: A graph that defines how each point in the input space is mapped to membership
value between 0 and 1. Input space is often referred to as the universe of discourse or
universal set (u), which contains all the possible elements of concern in each particular
application.
Singleton fuzzifier
Gaussian fuzzifier
It may not be designed to give accurate reasoning but it is designed to give acceptable
reasoning.
It can emulate human deductive thinking, that is, the process people use to infer
conclusions from what they know.
Any uncertainties can be easily dealt with the help of fuzzy logic.
This system can work with any type of inputs whether it is imprecise, distorted or noisy
input information.
AI&ML 63 R.THANIGAIVEL
Fuzzy logic comes with mathematical concepts of set theory and the reasoning of that is
quite simple.
It provides a very efficient solution to complex problems in all fields of life as it resembles
human reasoning and decision-making.
The algorithms can be described with little data, so little memory is required.
Many researchers proposed different ways to solve a given problem through fuzzy logic
which leads to ambiguity. There is no systematic approach to solve a given problem
through fuzzy logic.
Proof of its characteristics is difficult or impossible in most cases because every time we
do not get a mathematical description of our approach.
As fuzzy logic works on precise as well as imprecise data so most of the time accuracy
is compromised.
Application
It is used in the aerospace field for altitude control of spacecraft and satellites.
It has been used in the automotive system for speed control, traffic control.
It is used for decision-making support systems and personal evaluation in the large
company business.
It has application in the chemical industry for controlling the pH, drying, chemical
distillation process.
Fuzzy logic is used in Natural language processing and various intensive applications in
Artificial Intelligence.
Fuzzy logic is extensively used in modern control systems such as expert systems.
Fuzzy Logic is used with Neural Networks as it mimics how a person would make
decisions, only much faster. It is done by Aggregation of data and changing it into more
meaningful data by forming partial truths as Fuzzy sets.
Classical set
AI&ML 64 R.THANIGAIVEL
1. Classical set is a collection of distinct objects. For example, a set of students passing
grades.
3. The classical set is defined in such a way that the universe of discourse is splitted into two
groups members and non-members. Hence, In case classical sets, no partial
membership exists.
4. Let A is a given set. The membership function can be use to define a set A is given by:
Union:
Intersection:
Complement:
Difference:
Commutativity:
AI&ML 65 R.THANIGAIVEL
Associativity:
Distributivity:
Idempotency:
Identity:
Transitivity:
AI&ML 66 R.THANIGAIVEL
Fuzzy set:
1. Fuzzy set is a set having degrees of membership between 1 and 0. Fuzzy sets are
represented with tilde character(~). For example, Number of cars following traffic signals
at a particular time out of all cars present will have membership value between [0,1].
2. Partial membership exists when member of one fuzzy set can also be a part of other fuzzy
sets in the same universe.
3. The degree of membership or truth is not same as probability, fuzzy truth represents
membership in vaguely defined sets.
4. A fuzzy set A~ in the universe of discourse, U, can be defined as a set of ordered pairs
and it is given by
1. When the universe of discourse, U, is discrete and finite, fuzzy set A~ is given by
AI&ML 67 R.THANIGAIVEL
Algebraicsum:
Algebraicproduct:
AI&ML 68 R.THANIGAIVEL
Boundedsum:
Boundeddifference:
Crisp Set: Countability and finiteness are identical properties which are the collection objects
of crisp set. ‘X‘ is a crisp set defined as the group of elements present over the universal set
i.e. U. In this case a random element is present that may be a part of X or not that means two
ways are possible to define the set. These are first element would become from set X, or it does
not come from X.
Fuzzy Set: The Integration of the elements having a changing degree of membership in the set
is called as fuzzy set. The word “fuzzy” indicates vagueness, On the other hand, we can say
that the replacement among various degrees of the membership implies that the vague and
ambiguity of the fuzzy set. Hence, the measurement of the membership of the elements from
the universe in the set against a function for detecting the uncertainty and ambiguity.
1 Crisp set defines the value is either 0 Fuzzy set defines the value between 0 and 1
or 1. including both 0 and 1.
AI&ML 69 R.THANIGAIVEL
Eg2. Rahul is 1.6m tall Eg2. Rahul is about 1.6m tall.
7 Full membership means totally Partial membership means true to false, yes
true/false, yes/no, 0/1. to no, 0 to 1.
Crisp Set
Fuzzy Set
Fuzzy sets follow the same properties as crisp sets. Because of this fact and because the
membership values of a crisp set are a subset of the interval[0,1].
AI&ML 70 R.THANIGAIVEL
3.3 BASIC SET OF OPERATIONS
The union of two sets is a set containing all elements that are in A𝐴 or in B𝐵 (possibly both).
For example, {1,2}∪{2,3}={1,2,3}{1,2}∪{2,3}={1,2,3}. Thus, we can
write x∈(A∪B)𝑥∈(𝐴∪𝐵) if and only if (x∈A)(𝑥∈𝐴) or (x∈B)(𝑥∈𝐵). Note
that A∪B=B∪A𝐴∪𝐵=𝐵∪𝐴. In Figure 1.4, the union of sets A𝐴 and B𝐵 is shown by the shaded
area in the Venn diagram.
⋃i=1nAi.⋃𝑖=1𝑛𝐴𝑖.
The intersection of two sets A𝐴 and B𝐵, denoted by A∩B𝐴∩𝐵, consists of all elements that
are both in A𝐴 and−−−and_ B𝐵. For example, {1,2}∩{2,3}={2}{1,2}∩{2,3}={2}. In Figure
1.5, the intersection of sets A𝐴 and B𝐵 is shown by the shaded area using a Venn diagram.
AI&ML 71 R.THANIGAIVEL
More generally, for sets A1,A2,A3,⋯𝐴1,𝐴2,𝐴3,⋯, their intersection ⋂iAi⋂𝑖𝐴𝑖 is defined as
the set consisting of the elements that are in all Ai𝐴𝑖's. Figure 1.6 shows the intersection of
three sets.
The complement of a set A𝐴, denoted by Ac𝐴𝑐 or A¯𝐴¯, is the set of all elements that are in
the universal set S𝑆 but are not in A𝐴. In Figure 1.7, A¯𝐴¯ is shown by the shaded area using
a Venn diagram.
3.6 T-NORM
Definition
A t-norm is a function T: [0, 1] × [0, 1] → [0, 1] that satisfies the following properties:
AI&ML 72 R.THANIGAIVEL
Monotonicity: T(a, b) ≤ T(c, d) if a ≤ c and b ≤ d
Since a t-norm is a binary algebraic operation on the interval [0, 1], infix algebraic notation is
The defining conditions of the t-norm are exactly those of a partially ordered abelian
monoid on the real unit interval [0, 1]. (Cf. ordered group.) The monoidal operation of any
partially ordered abelian monoid L is therefore by some authors called a triangular norm on L.
Classification of t-norms
A t-norm is called nilpotent if it is continuous and each x in the open interval (0, 1) is nilpotent,
that is, there is a natural number n such that x ... x (n times) equals 0.
A t-norm is called Archimedean if it has the Archimedean property, that is, if for
each x, y in the open interval (0, 1) there is a natural number n such
As functions, pointwise larger t-norms are sometimes called stronger than those pointwise
smaller. In the semantics of fuzzy logic, however, the larger a t-norm, the weaker (in terms
of logical strength) conjunction it represents.
Properties of t-norms
The drastic t-norm is the pointwise smallest t-norm and the minimum is the pointwise largest
t-norm:
AI&ML 73 R.THANIGAIVEL
For every t-norm T, the number 0 acts as null element: T(a, 0) = 0 for all a in [0, 1].
A t-norm T has zero divisors if and only if it has nilpotent elements; each nilpotent element
of T is also a zero divisor of T. The set of all nilpotent elements is an interval [0, a] or
[0, a), for some a in [0, 1].
Although real functions of two variables can be continuous in each variable without being
continuous on [0, 1]2, this is not the case with t-norms: a t-norm T is continuous if and only if
it is continuous in one variable, i.e., if and only if the functions fy(x) = T(x, y) are continuous
for each y in [0, 1]. Analogous theorems hold for left- and right-continuity of a t-norm.
A continuous t-norm is Archimedean if and only if 0 and 1 are its only idempotents.
Thus with a continuous Archimedean t-norm T, either all or none of the elements of
(0, 1) are nilpotent. If it is the case that all elements in (0, 1) are nilpotent, then the t-norm
is isomorphic to the Łukasiewicz t-norm; i.e., there is a strictly
If on the other hand it is the case that there are no nilpotent elements of T, the t-norm is
isomorphic to the product t-norm. In other words, all nilpotent t-norms are isomorphic, the
Łukasiewicz t-norm being their prototypical representative; and all strict t-norms are
isomorphic, with the product t-norm as their prototypical example. The Łukasiewicz t-
norm is itself isomorphic to the product t-norm undercut at 0.25, i.e., to the function p(x, y)
= max(0.25, x ⋅ y) on [0.25, 1]2.
For each continuous t-norm, the set of its idempotents is a closed subset of [0, 1]. Its
complement—the set of all elements that are not idempotent—is therefore a union of
countably many non-overlapping open intervals. The restriction of the t-norm to any of
these intervals (including its endpoints) is Archimedean, and thus isomorphic either to the
Łukasiewicz t-norm or the product t-norm. For such x, y that do not fall into the same open
interval of non-idempotents, the t-norm evaluates to the minimum of x and y. These
conditions actually give a characterization of continuous t-norms, called the Mostert–
Shields theorem, since every continuous t-norm can in this way be decomposed, and the
described construction always yields a continuous t-norm. The theorem can also be
formulated as follows:
AI&ML 74 R.THANIGAIVEL
A t-norm is continuous if and only if it is isomorphic to an ordinal sum of the
minimum, Łukasiewicz, and product t-norm.A similar characterization theorem for
non-continuous t-norms is not known (not even for left-continuous ones), only some
non-exhaustive methods for the construction of t-norms have been found.
Fuzzy relations also map elements of one universe, say X, to those of another universe, say Y,
through the Cartesian product of the two universes. However, the ‘‘strength’’ of the relation
between ordered pairs of the two universes is not measured with the characteristic function,
but rather with a membership function expressing various ‘‘degrees’’ of strength of the
relation on the unit interval [0,1]. Hence, a fuzzy relation R is a mapping from the Cartesian
×
∼
space X Y to the interval [0,1], where the strength of the mapping is expressed by the
membership function of the relation for ordered pairs from the two universes, or µR(x, y).∼
Since the cardinality of fuzzy sets on any universe is infinity, the cardinality of a fuzzy
relation between two or more universes is also infinity.
Let R and S be fuzzy relations on the Cartesian space X Y. Then the following operations
×
∼ ∼
appl y for t he membership values for various set operations:
AI&ML 75 R.THANIGAIVEL
As seen in the foregoing expressions, the excluded middle axioms for relations do not result,
in general, in the null relation, O, or the complete relation, E.
Because fuzzy relations in general are fuzzy sets, we can define the Cartesian product to be
a relation between two or more fuzzy sets. Let A be a fuzzy set on universe X and B be a
∼
fuzzy set on universe Y; then the Cartesian prod uct between fuzzy sets A and B
∼ ∼
will resultin a fuzzy relation R, which is contained within the full Cartesian produc t
∼
spac e,
× former employs the idea of pairing of elements among sets, whereas the latter uses actual
× ×
arithmetic products between elements of sets. Each of the fuzzy sets could be thought of as
a vector of membership values; each value is associated with a particular element in each
set. For example, for a fuzzy set (vector) A that has four elements, hence column vector of
∼
size 4 1, and for a fuzzy set (vector) B t hat has five elements, hence a row vector size of 1
∼
5, the resulting fuzzy relation, R, w ill be represented by a matrix of size4
∼
5i.e.,will have four rows and five column s. This result is illustrated in the following
∼
example.
AI&ML 76 R.THANIGAIVEL
uzzy if-then rules are a fundamental component of fuzzy logic systems. These rules are used
to model relationships between input variables and output variables in a fuzzy inference system.
Each rule consists of two main parts: the antecedent (the "if" part) and the consequent (the
"then" part).
1. Antecedent (If Part): This part of the rule specifies the conditions under which the rule
should be applied. It typically involves one or more fuzzy sets defined over the input
variables. These fuzzy sets are often characterized by membership functions that
describe the degree to which each input variable satisfies the conditions.
2. Consequent (Then Part): This part of the rule specifies the action or conclusion to be
taken if the conditions specified in the antecedent are met. It typically involves one or
more fuzzy sets defined over the output variables, along with corresponding
membership functions that describe the degree to which the conclusion should be
applied.
For example, consider a simple fuzzy if-then rule in a temperature control system:
In this rule:
Consequent: "Heater Power is High" is a fuzzy set defined over the variable "Heater
Power."
The membership functions associated with these fuzzy sets determine the degree to which the
conditions are satisfied and the conclusions are applied. So, if the temperature is moderately
cold, the membership value for "Temperature is Cold" might be, say, 0.6, and this would
influence the degree to which the heater power should be set to high.
In a fuzzy inference system, multiple fuzzy if-then rules are combined and applied to input data
using fuzzy reasoning techniques to determine appropriate output values.
In the boolean system truth value, 1.0 represents the absolute truth value and 0.0 represents
the absolute false value. But in the fuzzy system, there is no logic for the absolute truth and
absolute false value. But in fuzzy logic, there is an intermediate value too present which is
partially true and partially false.
AI&ML 77 R.THANIGAIVEL
3.10 FUZZY REASONING
Fuzzy reasoning is a method of computing that deals with reasoning that is approximate
rather than fixed and exact. It's particularly useful in situations where traditional binary
true/false logic isn't adequate because of the presence of uncertainty or ambiguity.
1. Fuzzy Logic: Unlike classical logic, which operates on the principle of binary
true/false, fuzzy logic allows for degrees of truth. Instead of "true" or "false," statements
can be "partially true" or "mostly true" on a scale from 0 to 1.
2. Fuzzy Sets: Fuzzy logic uses fuzzy sets, which allow elements to have partial
membership in a set. For example, in classical set theory, an element is either in the set
or not. In fuzzy set theory, membership is a matter of degree.
4. Fuzzy Inference Systems: These are systems that use fuzzy logic to represent and
process data. They consist of fuzzy input variables, fuzzy inference rules, a fuzzy
inference engine, and fuzzy output variables.
In essence, fuzzy reasoning provides a framework for dealing with problems that are too
complex or uncertain to be adequately addressed by traditional logic.
AI&ML 78 R.THANIGAIVEL
3.11 Neuro-Fuzzy Modeling- ANFIS
Its inference system corresponds to a set of fuzzy IF–THEN rules that have learning capability
to approximate nonlinear functions Hence, ANFIS is considered to be a universal
estimator.[4] For using the ANFIS in a more efficient and optimal way, one can use the best
parameters obtained by genetic algorithm.It has uses in intelligent situational aware energy
management system.
It is possible to identify two parts in the network structure, namely premise and
consequence parts. In more details, the architecture is composed by five layers. The first layer
takes the input values and determines the membership functions belonging to them. It is
commonly called fuzzification layer. The membership degrees of each function are computed
by using the premise parameter set, namely {a,b,c}. The second layer is responsible of
generating the firing strengths for the rules. Due to its task, the second layer is denoted as "rule
layer". The role of the third layer is to normalize the computed firing strengths, by dividing
each value for the total firing strength. The fourth layer takes as input the normalized values
and the consequence parameter set {p,q,r}. The values returned by this layer are the
defuzzificated ones and those values are passed to the last layer to return the final output.
AI&ML 79 R.THANIGAIVEL
Fuzzification layer
The first layer of an ANFIS network describes the difference to a vanilla neural
network. Neural networks in general are operating with a data pre-processing step, in which
the features are converted into normalized values between 0 and 1. An ANFIS neural network
doesn't need a sigmoid function, but it's doing the preprocessing step by converting numeric
values into fuzzy values.[9]
Here is an example: Suppose, the network gets as input the distance between two points in the
2d space. The distance is measured in pixels and it can have values from 0 up to 500 pixels.
Converting the numerical values into Fuzzy numbers is done with the membership function
which consists of semantic descriptions like near, middle and far.[10] Each possible linguistic
value is given by an individual neuron. The neuron “near” fires with a value from 0 until 1, if
the distance is located within the category "near". While the neuron “middle” fires, if the
distance in that category. The input value “distance in pixels” is split into three different
neurons for near, middle and far.
APPLICATIONS:
The Adaptive Neuro-Fuzzy Inference System (ANFIS) has found applications across
various domains due to its ability to model complex systems, handle uncertainties, and adapt
to changing environments. Here are some notable applications:
2. Financial Forecasting: ANFIS has been applied in financial time series analysis for
tasks such as stock price prediction, foreign exchange rate forecasting, and credit risk
assessment. Its ability to capture nonlinear relationships and adapt to changing market
conditions makes it valuable in financial modeling.
5. Power Systems: ANFIS is utilized in power systems engineering for load forecasting,
fault detection, and energy management. It can handle the nonlinear and uncertain
nature of power system dynamics and assist in optimizing system performance.
AI&ML 80 R.THANIGAIVEL
6. Environmental Modeling: ANFIS is applied in environmental modeling for tasks such
as air quality prediction, water quality assessment, and ecological modeling. It can
analyze complex environmental data and aid in decision-making for environmental
management and planning.
7. Process Optimization: ANFIS is used in industrial process optimization for tasks such
as parameter tuning, product quality control, and energy efficiency improvement. It can
learn from historical process data and suggest optimal operating conditions.
10. Agricultural Systems: ANFIS is applied in precision agriculture for tasks such as crop
yield prediction, pest detection, and irrigation scheduling. It can analyze agricultural
data and provide recommendations for optimal crop management practices.
These are just a few examples of the diverse range of applications where ANFIS has been
successfully employed. Its versatility and effectiveness make it a valuable tool in addressing
complex problems across various domains.
In neuro-fuzzy modeling, a hybrid learning algorithm combines the advantages of both neural
networks and fuzzy logic to create robust and adaptive models. Here's an overview of a typical
hybrid learning algorithm used in neuro-fuzzy modeling:
1. Initialization:
o Initialize the parameters of the neural network, such as weights and biases.
o Utilize the available data to identify the structure and parameters of the fuzzy
logic system.
o Determine the number of fuzzy rules, shape of membership functions, and rule
bases using techniques like clustering or expert knowledge.
AI&ML 81 R.THANIGAIVEL
3. Neural Network Training:
o The input to the neural network can be the outputs of the fuzzy logic system or
a combination of original inputs and fuzzy logic outputs.
o Combine the outputs of the fuzzy logic system and the neural network to
generate the final output.
o Refine the parameters of both the fuzzy logic system and the neural network
based on feedback from the model's performance.
o Evaluate the performance of the hybrid model on unseen data to ensure its
generalization capability.
7. Iterative Improvement:
By combining the strengths of fuzzy logic and neural networks, the hybrid learning algorithm
in neuro-fuzzy modeling can capture complex relationships in data while maintaining
interpretability and adaptability. This approach is particularly useful for problems with
nonlinearities, uncertainties, and incomplete information.\
AI&ML 82 R.THANIGAIVEL
ANFIS architecture typically consists of five layers:
1. Input Layer: This layer receives the input variables/features of the system. Each node
in this layer represents one input variable.
2. Fuzzy Inference Layer: In this layer, each node computes the degree of membership
of the input variables to each fuzzy set. This layer performs fuzzification, converting
crisp inputs into fuzzy values based on predefined membership functions.
3. Rule Layer: Here, the firing strength of each rule is calculated by combining the
degrees of membership from the fuzzy inference layer. Each node represents a rule in
the fuzzy rule base.
4. Consequent Layer: This layer computes the output of each rule by multiplying the
firing strength of the rule (from the rule layer) with the consequent parameters (typically
linear coefficients).
5. Output Layer: The output layer aggregates the outputs of all rules to produce the final
output of the system.
The parameters of the membership functions and the linear coefficients in ANFIS are adapted
through a hybrid learning algorithm, which typically involves a combination of gradient-based
methods and least squares optimization. This hybrid learning algorithm aims to minimize the
error between the actual and predicted outputs of the system.
2. Forward Pass: Compute the output of the ANFIS model for each input sample by
propagating the inputs through the network according to the defined architecture.
3. Error Calculation: Calculate the error between the predicted output and the actual
output for each sample.
6. Convergence: Repeat steps 2-5 iteratively until convergence criteria are met (e.g., a
predefined number of iterations or a sufficiently small error threshold).
AI&ML 83 R.THANIGAIVEL
ANFIS is widely used in various fields such as system identification, control systems, time-
series prediction, and modeling of complex nonlinear systems due to its ability to capture both
explicit and implicit knowledge from data.
o Fuzzy logic can be used to model the rules that govern equipment health based
on factors such as temperature, vibration, and usage time.
o Linguistic rules might include statements like "If vibration is high and
temperature is high, then the machine is likely to fail soon."
o Fuzzy membership functions define the degree to which input variables belong
to various fuzzy sets (e.g., "low," "medium," "high").
2. Neural Network:
o A neural network can learn complex patterns and correlations in sensor data that
may not be explicitly captured by fuzzy rules.
o The neural network can analyze historical sensor data to predict the remaining
useful life of machinery or the likelihood of failure.
3. Hybrid Integration:
o Combine the output of the fuzzy logic system and the neural network using
weighted averaging or other fusion techniques.
o During the training phase, the parameters of both the fuzzy logic system (e.g.,
membership functions, rule bases) and the neural network (e.g., weights, biases)
are adjusted using a hybrid optimization algorithm.
AI&ML 84 R.THANIGAIVEL
5. Validation and Testing:
o Evaluate the hybrid model's performance on unseen sensor data to assess its
ability to accurately predict equipment health or failure.
6. Iterative Improvement:
o This might involve adding more fuzzy rules, adjusting membership functions,
or changing the neural network architecture.
By combining fuzzy logic and neural networks in a hybrid learning model for predictive
maintenance, manufacturers can leverage the interpretability of fuzzy logic while harnessing
the predictive power of neural networks to optimize maintenance schedules, reduce downtime,
and extend equipment lifespan.
AI&ML 85 R.THANIGAIVEL
UNIT IV
Researchers now favour a bottom-up approach where simple rules are combined with
adaptive systems to allow complex behaviours to emerge. This contrasts with earlier attempts
to manually encode intelligence, such as in expert systems. Examples of this approach include
neural networks and evolutionary computation.
History:
In the 1950s and 1960s, several computer scientists independently developed various
evolutionary computation methods. In the 1960s, Rechenberg introduced evolution strategies,
optimizing real-valued parameters using a parent-child mutation model, further developed by
Schwefel. Evolutionary programming, developed by Fogel, Owens, and Walsh, evolved finite-
state machines through random mutations. Holland invented genetic algorithms (GAs) in the
1960s, focusing on bit strings and using mutation and crossover for variation. Unlike evolution
strategies and evolutionary programming, Holland aimed to study natural adaptation formally.
AI&ML 86 R.THANIGAIVEL
Over time, the distinctions between different evolutionary computation methods, such as
evolution strategies, evolutionary programming, and genetic algorithms, have blurred, with
researchers increasingly interacting and integrating these approaches. This review focuses
mainly on genetic algorithms, examining their applications in business, science, and
education, and discussing their relevance to evolutionary biology. The review does not cover
the extensive theoretical foundations, which are detailed in the Foundations of Genetic
Algorithms proceedings.
2. Features of EC
1. Commercial Applications of Evolutionary Algorithms
Evolutionary algorithms are used in various commercial and scientific applications to search
through numerous possibilities for good, if not optimal, solutions, a concept termed
"satisficing" by Simon. Traditional search methods, such as hill-climbing, work well for many
problems, but evolutionary algorithms are particularly useful when there are many parameters and
high complexity.
Key Applications:
Drug Design: Evolutionary algorithms help design drugs by predicting how well ligands bind
to enzymes, crucial for creating inhibitors for HIV protease enzymes. Natural Selection Inc.
provided software to Aguron Pharmaceuticals that combines ligand-protein interaction models
with evolutionary programming to explore ligand-protein configurations.
Supply-Chain Management: Companies like Volvo and Deere C Company use evolutionary
algorithms for scheduling production. For instance, I2 Technologies’ scheduling program evolves
schedules for plant production, optimizing inventory and manufacturing processes.
Stock Market Prediction: Financial institutions like Citibank and Swiss Bank use evolutionary
algorithms to predict stock market prices. These programs evolve by backcasting, predicting
recent data to refine their accuracy.
Evolvable Hardware (EHW): Asahi Microsystems developed an EHW chip for cellular
phones that adjusts parameters to meet performance specifications, improving yield rates and
reducing circuit size and power consumption. The lab of T Higuchi at Tsukuba demonstrated this
technology, leading to the production of tunable chips.
These examples illustrate the growing use of evolutionary algorithms in practical, high-stakes
environments, ranging from pharmaceuticals to manufacturing and finance. Additional
applications and studies can be found in journals such as Evolutionary Computation and IEEE
Transactions on Evolutionary Computing.
AI&ML 87 R.THANIGAIVEL
2. Using Genetic Programming for Optimal Foraging Strategies
Their model considered four variables: insect abundance, lizard sprint velocity,
and the coordinates of the insect in the lizard's view. A strategy function
determined whether an insect should be chased based on these variables.
Simulations involved assigning values to these variables and allowing the lizard
to chase insects for a set time period.
In one experiment, the lizard's viewing area was divided into three regions with
different escape probabilities for insects. The optimal strategy involved ignoring
insects in one region, chasing those in another region without escape, and chasing
selectively in a third region based on escape probability and distance.
AI&ML 88 R.THANIGAIVEL
Classification of EC – Advantages – Applications
Genetic Algorithms (GA): Genetic Algorithms are the most well-known form of EC1.
They evolve a population of solutions, each represented as a bit-string (the genotype),
with a fitness function measuring the fitness of the bit-string within the context of the
problem.
Advantages: GAs are versatile and can be applied to a wide range of problem
domains. They are capable of solving many problems competently.
Applications: GAs have been applied in healthcare for enhancing diagnosis
precision and treatment optimization, in finance for improving predictive accuracy
in market trends and portfolio management, and in supply chain for optimizing
routes, reducing costs and delivery times.
Genetic Programming (GP): This is a variant of GA that evolves computer
programs, typically represented as tree structures.
AI&ML 89 R.THANIGAIVEL
Advantages: GP is powerful for performing symbolic regressions and feature
classifications4. It saves time by processing large amounts of data much more
quickly than humans can.
Applications: GP has been used in conjunction with other forms of machine
learning.
Evolution Strategies (ES): These are optimization algorithms that use mechanisms
inspired by biological evolution, such as mutation, recombination, and selection.
Advantages: ESs are successful global optimization methods5. They can optimize
in restricted solution spaces.
Applications: ESs are mainly used for the simulation-based optimization, i.e.,
computerized models that require parametric optimization.
Differential Evolution (DE): This is a method that optimizes a problem by iteratively
improving candidate solutions with regard to a given measure of quality.
Advantages: DE is simple and easy to implement. It has strong robustness, fewer control
parameters, and better search capabilities.
Applications: DE has been applied to tackle diverse problems across various fields
and real-world applications.
Evolutionary Programming (EP): This is similar to ES, but focuses more on the
evolution of program structures.
Advantages: EP places greater emphasis on its own development, thus it has the
advantages of simple description, flexible use, high efficiency, strong robustness,
and a limited number of conditions.
Applications: EP has been effective in solving problems with a variety of
characteristics, and within many application domains.
Permutation-based Evolutionary Algorithms: These are used for combinatorial
optimization problems where the solution can be represented as a sequence or
permutation of numbers.
Advantages: Permutation-based encoding is used by many evolutionary algorithms
dealing with combinatorial optimization problems.
Applications: Permutation-based Evolutionary Algorithms have been applied in many
fields such as engineering, design, medicine, robotics, science, etc.
Memetic Algorithms (MA): These are based on the concept of a meme, which is
considered as a unit of information that can be replicated and selected.
Advantages: MAs combine exploration with exploitation abilities provided by the local
search, and they reduce the premature convergence to local optima due to a better
exploration of the solutions space.
AI&ML 90 R.THANIGAIVEL
Applications: MAs have been applied to solve several versatile real-world
optimization tasks, ranging from robotics, wireless networks, power systems, job shop
scheduling, to classification and training of artificial neural networks.
Estimation of Distribution Algorithms (EDA): These algorithms replace traditional
genetic operators by building and sampling a probabilistic model of promising
solutions.
Advantages: EDAs are a most successful paradigm of EAs14. They are derived by
inspirations from evolutionary computation and machine learning.
Applications: EDAs have been effective in solving problems with a variety of
characteristics, and within many application domains.
Particle Swarm Optimization (PSO): This is a computational method that optimizes
a problem by iteratively trying to improve a candidate solution with regard to a given
measure of quality.
Advantages: PSO is a flexible and easy-to-implement algorithm that doesn’t require
hyperparameter tuning. It is a versatile tool for numerous optimization problems.
Applications: PSO has been applied in health-care, environmental, industrial,
commercial, smart city, and general aspects applications.
Interactive Evolutionary Algorithms: These involve human interaction, often in the
role of fitness function.
Advantages: Interactive Evolutionary Algorithms provide insight and guidance
beyond simply selecting parents for breeding.
Applications: Applications of Interactive Evolutionary Algorithms (IEAs) range
from capturing aesthetics in art and design, to the personalisation of artefacts such as
medical devices
AI&ML 91 R.THANIGAIVEL
Many more evolutionary algorithms also exist. These include Gene Expression Programming,
Differential Evolution, Learning Classifier Systems, and Neuroevolution.
These algorithms make use of operators like mutation and recombination. Sometimes they use
both operators together.
One of the uses of genetic algorithms is selecting the right combination of variables to build a
predictive model. Selecting the right subset of variables is essentially a combinatory and
optimization problem.
The advantage of genetic algorithms is that it makes it possible for the best solution to emerge
from the best of prior solutions. It improves the selection over time.
The whole idea behind genetic algorithms is to combine the differnet solutions generation
after generation so that it can extract the best genes or variables from each solution. It helps
creating better fitted individuals.
Genetic algorithms are also used for hyper-tuning parameters, finding the maximum or minimum
of a function, or the search for a correct neural network architecture (Neuroevolution). It is
also used in feature selection.
The idea of genetic algorithms (GA) is to generate a few random possible solutions that represent
different variables, and then combine the best possible solutions in an iterative process. The
basic genetic algorithm operations are selection (picking the most fitted solutions in a generation),
cross-over (creating two new individuals, based on the genes of solutions), and mutation
(changing a gene randomly in an individual).
AI&ML 92 R.THANIGAIVEL
Improvements are made by stochastic variation of programs and selection in line with some
predefined criteria for judging the quality of a solution. Programs of genetic programming
systems essentially evolve to solve predescribed automatic programming and machine learning
problems.
In its essence, genetic programming is a heuristic search technique that is commonly called ‘hill
climbing’. It involves searching for an optimal or at least a suitable program among the space of
all programs.
This evolutionary algorithm paradigm was first used by Lawrence J. Fogel in 1960 in an attempt
to use simulated evolution as a learning process seeking to create artificial intelligence. He
used finite-state machines as predictors and evolved them. Right now, evolutionary programming
is a wide evolutionary computing dialect that has no fixed structure or representation. It is
becoming increasingly difficult to differentiate evolutionary programming from evolutionary
strategies.
Evolutionary strategies are optimization techniques that are based on the ideas of evolution.
They use natural problem-dependent representations and mainly make use of mutation and
selection as search operators. The operators are applied in a loop, an iteration of which is known
as a generation. The sequence of generations continues till a termination criterion is met. Most
evolutionary algorithms work on a genotype level, but evolutionary strategies work on a behavioral
level.
Since the physical expression is coded directly, an individual’s genes are not mapped to
its physical expression.
AI&ML 93 R.THANIGAIVEL
This approach is followed to give rise to a strong causality so that a small change in the coding
gives rise to a small change in the individual and a large change in the coding causes a large
change in the individual.
In the domain of computer vision, evolutionary computation techniques have been employed
to evolve neural network architectures and parameters for image recognition tasks. By iteratively
optimizing the network's structure and connection weights, evolutionary algorithms
contribute to the development of robust and efficient image recognition systems.
In the realm of robotics, evolutionary strategies have been harnessed to optimize the
locomotion and control mechanisms of robotic systems. By evolving the parameters and
behaviors of robotic agents through simulated evolutionary processes, researchers have achieved
advancements in adaptive and resilient robotic locomotion strategies.
Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of natural
selection and genetics. These are intelligent exploitation of random searches provided with
historical data to direct the search into the region of better performance in solution space. They
are commonly used to generate high-quality solutions for optimization problems and search
AI&ML 94 R.THANIGAIVEL
problems.
Genetic algorithms simulate the process of natural selection which means those species that
can adapt to changes in their environment can survive and reproduce and go to the next generation.
In simple words, they simulate “survival of the fittest” among individuals of consecutive
generations to solve a problem. Each generation consists of a population of individuals and each
individual represents a point in search space and possible solution. Each individual is
represented as a string of character/integer/float/bits. This string is analogous to the
Chromosome.
Genetic algorithms are based on an analogy with the genetic structure and behavior of
chromosomes of the population. Following is the foundation of GAs based on this analogy –
Those individuals who are successful (fittest) then mate to create more offspring than others
Genes from the “fittest” parent propagate throughout the generation, that is sometimes
parents create offspring which is better than either parent.
Search space
The population of individuals are maintained within search space. Each individual represents
a solution in search space for given problem. Each individual is coded as a finite length vector
(analogous to chromosome) of components. These variable components are analogous to
Genes. Thus a chromosome (individual) is composed of several genes (variable components).
Fitness Score
A Fitness Score is given to each individual which shows the ability of an individual to
“compete”. The individual having optimal fitness score (or near optimal) are sought.
The GAs maintains the population of n individuals (chromosome/solutions) along with their
fitness scores.The individuals having better fitness scores are given more chance to reproduce
than others. The individuals with better fitness scores are selected who mate and produce better
offspring by combining chromosomes of parents. The population size is static so the room
has to be created for new arrivals. So, some individuals die and get replaced by new arrivals
AI&ML 95 R.THANIGAIVEL
eventually creating new generation when all the mating opportunity of the old population is
exhausted. It is hoped that over successive generations better solutions will arrive while least fit
die.
Each new generation has on average more “better genes” than the individual (solution) of previous
generations. Thus each new generations have better “partial solutions” than previous generations.
Once the offspring produced having no significant difference from offspring produced by previous
populations, the population is converged. The algorithm is said to be converged to a set of
solutions for the problem.
Once the initial generation is created, the algorithm evolves the generation using following
operators
1) Selection Operator: The idea is to give preference to the individuals with good fitness scores
and allow them to pass their genes to successive generations.
2) Crossover Operator: This represents mating between individuals. Two individuals are
selected using selection operator and crossover sites are chosen randomly. Then the genes at
these crossover sites are exchanged thus creating a completely new individual (offspring). For
example –
3) Mutation Operator: The key idea is to insert random genes in offspring to maintain the diversity
in the population to avoid premature convergence. For example –
Given a target string, the goal is to produce target string starting from a random string of the same
length. In the following implementation, following analogies are made –
Characters A-Z, a-z, 0-9, and other special symbols are considered as genes
Fitness score is the number of characters which differ from characters in target string at a particular
index. So individual having lower fitness value is given more preference.
#include
<bits/stdc++.h> using
namespace std;
// Valid Genes
AI&ML 97 R.THANIGAIVEL
{
int range = (end-start)+1;
int random_int = start+(rand()%range);
return random_int;
{
int len = GENES.size();
int r = random_num(0, len-
1); return GENES[r];
{
int len =
TARGET.size();
gnome += mutated_genes();
return gnome;
{
public:
string
chromosome; int
fitness;
AI&ML 98 R.THANIGAIVEL
Individual(string chromosome);
Individual mate(Individual parent2);
int cal_fitness();
};
Individual::Individual(string chromosome)
{
this->chromosome =
chromosome; fitness =
cal_fitness();
};
{
// chromosome for offspring
string child_chromosome = "";
int len =
chromosome.size();
for(int i = 0;i<len;i++)
{
// random probability
float p = random_num(0, 100)/100;
child_chromosome += chromosome[i];
AI&ML 99 R.THANIGAIVEL
// otherwise insert random gene(mutate),
// for maintaining
diversity else
child_chromosome += mutated_genes();
}
};
for(int i = 0;i<len;i++)
{
if(chromosome[i] !=
TARGET[i]) fitness++;
}
return fitness;
};
{
srand((unsigned)(time(0)));
// current
generation int
generation = 0;
vector<Individual> population;
bool found = false;
while(! found)
{
// sort the population in increasing order of fitness score
sort(population.begin(), population.end());
{
found = true;
break;
for(int i = 0;i<s;i++)
{
int len = population.size();
int r = random_num(0,
50);
Individual parent1 =
population[r]; r =
random_num(0, 50);
}
population = new_generation;
cout<< "Generation: " << generation << "\t";
cout<< "String: "<< population[0].chromosome <<"\t";
cout<< "Fitness: "<< population[0].fitness << "\n";
generation++;
}
cout<< "Generation: " << generation << "\t";
Output:
Generation: 1 String: tO{"-?=jH[k8=B4]Oe@} Fitness: 18
Generation: 2 String: tO{"-?=jH[k8=B4]Oe@} Fitness: 18
Generation: 3 String: .#lRWf9k_Ifslw #O$k_ Fitness: 17
Generation: 4 String: .-1Rq?9mHqk3Wo]3rek_ Fitness: 16
Generation: 5 String: .-1Rq?9mHqk3Wo]3rek_ Fitness: 16
Generation: 6 String: A#ldW) #lIkslw cVek) Fitness: 14
Generation: 7 String: A#ldW) #lIkslw cVek) Fitness: 14
Generation: 8 String: (, o x _x%Rs=, 6Peek3 Fitness: 13
.
.
.
Generation: 29 String: I lope Geeks#o, Geeks Fitness: 3
Generation: 30 String: I loMe GeeksfoBGeeks Fitness: 2
Generation: 31 String: I love Geeksfo0Geeks Fitness: 1
Generation: 32 String: I love Geeksfo0Geeks Fitness: 1
Note: Every-time algorithm start with random strings, so output may differ
As we can see from the output, our algorithm sometimes stuck at a local optimum solution,
this can be further improved by updating fitness score calculation algorithm or by tweaking
mutation and crossover operators.
Algorithms They
are Robust
Unlike traditional AI, they do not break on slight change in input or presence of noise
Recurrent Neural
Network Mutation
testing
A genetic operator is an operator used in genetic algorithms to guide the algorithm towards a
solution to a given problem. There are three main types of operators (mutation, crossover and
selection), which must work in conjunction with one another in order for the algorithm to be
successful. Genetic operators are used to create and maintain genetic diversity (mutation
operator), combine existing solutions (also known as chromosomes) into new solutions
(crossover) and select between solutions (selection). In his book discussing the use of genetic
programming for the optimization of complex problems, computer scientist John Koza has also
identified an 'inversion' or 'permutation' operator; however, the effectiveness of this operator
has never been conclusively demonstrated and this operator is rarely discussed.
Mutation (or mutation-like) operators are said to be unary operators, as they only operate on
one chromosome at a time. In contrast, crossover operators are said to be binary operators, as
they operate on two chromosomes at a time, combining two existing chromosomes into one
new chromosome.
Operators
Genetic variation is a necessity for the process of evolution. Genetic operators used in genetic
algorithms are analogous to those in the natural world: survival of the fittest, or selection;
reproduction (crossover, also called recombination); and mutation.
Selection
Crossover
Main article: Crossover (genetic algorithm)
Mutation
The mutation operator encourages genetic diversity amongst solutions and attempts to
prevent the genetic algorithm converging to a local minimum by stopping the solutions becoming
too close to one another. In mutating the current pool of solutions, a given solution may change
entirely from the previous solution. By mutating the solutions, a genetic algorithm can reach
an improved solution solely through the mutation operator.[1] Again, different methods of
mutation may be used; these range from a simple bit mutation (flipping random bits in a binary
string chromosome with some low probability) to more complex mutation methods, which may
replace genes in the solution with random values chosen from the uniform distribution or the
Gaussian distribution. As with the crossover operator, the mutation method is usually chosen to
match the representation of the solution within the chromosome.
Combining operators
While each operator acts to improve the solutions produced by the genetic algorithm
working individually, the operators must work in conjunction with each other for the algorithm
to be successful in finding a good solution. Using the selection operator on its own will tend to fill
the solution population with copies of the best solution from the population. If the selection
and crossover operators are used without the mutation operator, the algorithm will tend to
converge to a local minimum, that is, a good but sub- optimal solution to the problem. Using the
mutation operator on its own leads to a random walk through the search space. Only by using
all three operators together can the genetic algorithm become a noise-tolerant hill-climbing
algorithm, yielding good solutions to the problem
During each generation, three basic genetic operators are sequentially applied to each
individual with certain probabilities, i.e. selection, crossover and mutation.
The GAs is computer program that simulate the heredity and evolution of living
organisms. An optimum solution is possible even for multi modal objective functions utilizing
GAs because they are multi-point search methods. Also, GAs is applicable to discrete search
space problems. Thus, GA is not only very easy to use but also a very powerful optimization
tool . In GA, the search space consists of strings, each of which representing a candidate
solution to the problem and are termed as chromosomes. The objective function value of each
chromosome is called its fitness value. Population is a set of chromosomes along with their
associated fitness. Generations are populations generated in an iteration of the GA. Genetic
algorithm to search a space of candidate solutions to identify the best one is as shown in the
diagram.
GA searches for better solutions by genetic operations, including selection operation, crossover
operation and mutation operation.
1) Roulette Wheel Selection Parents are selected according to their fitness. The better the
chromosomes are, the more chances to be selected they have. Imagine a roulette wheel where
are placed all chromosomes in the population, every has its place big accordingly to its
fitness function like on the Figure 2. Chromosome with bigger fitness will be selected more
times.
2) Rank Selection The previous selection method will have problems when the fitness’s differ
very much. For example, if the best chromosome fitness is 90% of the entire roulette wheel,
then the other chromosomes will have very few chances to be selected. Rank selection first sorts
the population by fitness and then every chromosome receives fitness from this ranking. The
worst will have fitness 1, second worst 2 etc. and the best will have fitness N (number of
chromosomes in population). After this, all the chromosomes have a chance to be selected. The
probability that a chromosome will be selected is then proportional to its rank in this sorted list,
rather than its fitness. But this method can lead to slower convergence, because the best
chromosomes do not differ so much from other ones.
3) Elitism Selection When creating new population by crossover and mutation; we have a big
chance, that we will lose the best chromosome. Elitism is name of method, which first copies
the best chromosome (or a few best chromosomes) to new population. The rest is done in
classical way. Elitism can very rapidly increase performance of GA, because it prevents
B. Crossover Operations
The generation of successors in a GA is determined by a set of operators that recombine and mutate
selected members of the current population. The two most common operators are crossover
and mutation. The crossover operator produces two new offspring from two parent strings, by
copying selected bits from each parent. The bit at position i in each offspring is copied from the bit
at position i in one of the two parents. The choice of which parent contributes the bit for position
i is determined by an additional string called the crossover mask. There are three types of
crossover operators, namely as single-point, two-point and uniform crossover.
3) Uniform Crossover Uniform crossover combines bits sampled uniformly from the two
parents, as illustrated in Figure 3. In this case the crossover mask is generated as a
random bit string with each bit chosen at random and independent of the others
C. Mutation Operations
In addition to recombination operators that produce offspring by combining parts of two parents, a
second type of operator produces offspring from a single parent. In particular, the mutation
operator produces small random changes to the bit string by choosing a single bit at random, then
changing its value. Mutation is often performed after crossover as in Figure 3.
Variants
In a “classical” genetic algorithm, the genes are encoded in a fixed order. The meaning of a single
gene is determined by its position inside the string. We have seen in the previous chapter that
a genetic algorithm is likely to converge well if the optimization task can be divided into several
short building blocks. What, however, happens if the coding is chosen such that couplings
occur between distant genes? Of course, one- point crossover tends to disadvantage long
schemata (even if they have low order) over short ones.
Since the genes can be identified uniquely by the help of the index, genes may swapped
arbitrarily without changing the meaning of the string. With appropriate genetic operations,
which also change the order of the pairs, the GA could possibly group coupled genes together
automatically.
Due to the free arrangement of genes and the variable length of the encoding, we can,
however, run into problems which do not occur in a simple GA. First of all, it can happen that
there are two entries in a string which correspond to the same index, but have conflicting alleles.
The most obvious way to overcome this “over-specification” is positional preference—the first
entry which refers to a gene is taken. Figure 4.2 shows an example.
The reader may have observed that the genes with indices 3 and 5 do not occur at all in
the example in Figure 4.2. This problem of “underspecification” is more complicated and
its solution is not as obvious as for over-specification. Of course, a lot of variants are reasonable.
One approach could be to check all possible combinations and to take the best one (for k
missing genes, there are 2k combinations). With the objective to reduce this effort, have
suggested to use so-called templates for finding specifications for k missing genes. It is nothing
else than applying a local hill climbing method with random initial value to the k missing genes.
Depending on the actual problem, other selection schemes than the roulette wheel can
be useful:
Linear rank selection: In the beginning, the potentially good individuals sometimes
fill the population too fast which can lead to premature convergence into local maxima. On the
other hand, refinement in the end phase can be slow since the individuals have similar fitness
values. These problems can be overcome by taking the rank of the fitness values as the basis for
selection instead of the values themselves.
Tournament selection: Closely related to problems above, it can be better not to use the
fitness values themselves. In this scheme, a small group of individuals is sampled from the
population and the individual with best fitness is chosen for reproduction. This selection
scheme is also applicable when the fitness function is given in implicit form, i.e. when we only
have a comparison relation which determines which of two given individuals is better.
Moreover, there is one “plug-in” which is frequently used in conjunction with any of the
three selection schemes we know so far—elitism. The idea is to avoid that the observed best-
fitted individual dies out just by selecting it for the next generation without any random
experiment. Elitism is widely used for speeding up the convergence
Adaptive Genetic Algorithms Adaptive genetic algorithms are GAs whose parameters, such as
the population size, the crossing over probability, or the mutation probability are varied while
the GA is running (e.g. see [8]). A simple variant could be the following: The mutation rate is
changed according to changes in the population; the longer the population does not improve,
the higher the mutation rate is chosen. Vice versa, it is decreased again as soon as an
improvement of the population occurs.
As they use the fitness function only in the selection step, genetic algorithms are blind optimizers
which do not use any auxiliary information such as derivatives or other specific knowledge
about the special structure of the objective function. If there is such knowledge, however, it is
unwise and inefficient not to make use of it. Several investigations have shown that a lot of
synergism lies in the combination of genetic algorithms and conventional methods. The basic
idea is to divide the optimization task into two complementary parts. The coarse, global
optimization is done by the GA while local refinement is done by the conventional method (e.g.
gradient-based, hill climbing, greedy algorithm, simulated annealing, etc.). A number of variants
is reasonable:
2. The local method is integrated in the GA. For instance, every K generations, the
population is doped with a locally optimal individual.
3. Both methods run in parallel: All individuals are continuously used as initial values
for the local method. The locally optimized individuals are re-implanted into the current
generation.
As already mentioned, the reproduction methods and the representations of the genetic
material were adapted through the billions of years of evolution [25]. Many of these adaptations
were able to increase the speed of adaptation of the individuals. We have seen several times that
the choice of the coding method and the genetic operators is crucial for the convergence of a GA.
Therefore, it is promising not to encode only the raw genetic information, but also some additional
information, for example, parameters of the coding function or the genetic operators. If this is done
properly, the GA could find its own optimal way for representing and manipulating data
automatically.
Genetic algorithms (GAs) are search heuristics inspired by the process of natural selection.
They are used to solve optimization and search problems by mimicking the process of
biological evolution. Here are some notable applications of genetic algorithms:
1. Optimization Problems
Function Optimization: GAs are used to find the maximum or minimum of complex functions
in engineering, economics, and operations research.
Parameter Tuning: They help in tuning hyperparameters for machine learning models to enhance
performance.
2. Engineering Design
Structural Design: Used in designing structures such as bridges and aircraft to optimize for
weight, strength, and cost.
Control Systems: Optimizing control parameters for systems in robotics, aerospace, and
manufacturing.
3. Machine Learning
Feature Selection: Identifying the most relevant features in large datasets to improve model
accuracy and reduce computation time.
Neural Network Training: Optimizing the architecture and weights of neural networks.
4. Scheduling
Strategy Development: Developing strategies for board games, video games, and
simulations.
Evolving Agents: Creating AI agents that can learn and adapt to new environments.
6. Biotechnology
DNA Sequencing: Aligning DNA sequences and finding motifs in bioinformatics. Drug
Path Planning: Determining the optimal path for robots in dynamic environments.
Evolutionary Robotics: Designing robot structures and behaviors through evolutionary processes.
2. Finance
3. Telecommunications
Network Design: Optimizing the layout and operation of communication networks. Routing
5. Environmental Science
Resource Management: Optimizing the use of natural resources like water and forests. Pollution
Music Composition: Composing music using evolutionary algorithms to innovate new styles
and forms.
Global Search Capability: They can avoid local optima and find global solutions. Adaptability: GAs
Parallelism: They are naturally parallel and can be implemented on parallel hardware for faster
computations.
Computational Cost: GAs can be computationally intensive, especially for large problems.
Parameter Sensitivity: The performance of GAs can be sensitive to the choice of parameters
properly control
Support Vector Machine (SVM) is one of the Machine Learning (ML) Supervised algorithms.
There are plenty of algorithms in ML, but still, reception for SVM is always special because of
its robustness while dealing with the data. So here in this article, we will be covering almost
all the necessary things that need to drive for any kind of data w.r.t SVM
3. Goal – Create the best decision boundary that can segregate n- dimensional space
into classes so that we can easily put the new data points in the correct category –
Hyperplane.
4. Out-of-the-box classifier