0% found this document useful (0 votes)
116 views116 pages

AI&ML BM4251 Unit 1-5 Notes

Machine learning, a subset of artificial intelligence, enables machines to learn from data and improve their performance over time without explicit programming. It consists of various processes including data storage, abstraction, generalization, and evaluation, and can be categorized into supervised, unsupervised, and reinforcement learning. Applications of machine learning span diverse fields such as retail, finance, manufacturing, and medicine, where it is used for tasks like consumer behavior analysis, credit risk assessment, and medical diagnosis.

Uploaded by

createajet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views116 pages

AI&ML BM4251 Unit 1-5 Notes

Machine learning, a subset of artificial intelligence, enables machines to learn from data and improve their performance over time without explicit programming. It consists of various processes including data storage, abstraction, generalization, and evaluation, and can be categorized into supervised, unsupervised, and reinforcement learning. Applications of machine learning span diverse fields such as retail, finance, manufacturing, and medicine, where it is used for tasks like consumer behavior analysis, credit risk assessment, and medical diagnosis.

Uploaded by

createajet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

UNIT – I

INTRODUCTION TO MACHINE LEARNING

1.1Machine learning
Machine Learning is a branch of Artificial Intelligence that allows machines to learn
and improve from experience automatically. It is defined as the field of study that gives
computers the capability to learn without being explicitly programmed. It is quite different
than traditional programming.

Machine learning is important because it gives enterprises a view of trends in


customer behavior and business operational patterns, as well as supports the development
of new products. Many of today's leading companies, such as Facebook, Google and Uber,
make machine learning a central part of their operations.
The life of Machine Learning programs is straightforward and can be summarized in the
following points:
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying

To solve a problem on a computer, we need an algorithm. An algorithm is a sequence


of instructions that should be carried out to transform the input to output.
For example, one can devise an algorithm for sorting. The input is a set of numbers and the
output is their ordered list. For the same task, there may be various algorithms and we may be
interested in finding the most efficient one, requiring the least number of
instructions or memory or both.
For some tasks, however, we do not have an algorithm—for example, to tell spam
emails from legitimate emails. We know what the input is: an email document that in the
simplest case is a file of characters. We know what the output should be: a yes/no output
indicating whether the message is spam or not. We do not know how to transform the input
to the output. What can be considered spam changes in time and from individual to individual.
What we lack in knowledge, we make up for in data. We can easily compile thousands
of example messages some of which we know to be spam and what we want is to “learn” what
consititutes spam from them. In other words, we would like the computer (machine) to extract
automatically the algorithm for this task. There is no need to learn to sort
numbers, we already have algorithms for that; but there are many applications for which we do
not have an algorithm but do have example data. With advances in computer technology, we
currently have the ability to store and process large amounts of data, as well as to access it from

AI&ML 1 R.THANIGAIVEL
physically distant locations over a computer network. Most data acquisition devices are digital
now and record reliable data.

Think, for example, of a supermarket chain that has hundreds of stores all over a country
selling thousands of goods to millions of customers. The point of sale terminals record the
details of each transaction: date, customer identification code, goods bought and their amount,
total money spent, and so forth. This typically amounts to gigabytes of data every day. What
the supermarket chain wants is to be able to predict who are the likely customers for a product.
Again, the algorithm for this is not evident; it changes in time
and by geographic location. The stored data becomes useful only when it is analyzed and turned
into information that we can make use of, for example, to make predictions.

We do not know exactly which people are likely to buy this ice cream flavor, or the
next book of this author, or see this new movie, or visit this city, or click this link. If we knew,
we would not need any analysis of the data; we would just go ahead and write down the code.
But because we do not, we can only collect data and hope to extract the answers to these and
similar questions from data.

Definition of Machine learning


A computer program is said to learn from experience E with respect to some class of
tasks T and Performance measure P , if its performance at tasks T , as measured by P , improves
with experience E.

Examples

i) Handwriting recognition learning problem


• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given classifications

ii) A robot driving learning problem


• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• training experience: A sequence of images and steering commands recorded while
observing a human driver

iii) A chess learning problem


• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
• Training experience E: Playing practice games against itsel

AI&ML 2 R.THANIGAIVEL
1.2 Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four
components, namely, data storage, abstraction, generalization and evaluation. Figure 1.1
illustrates the various components and the steps involved in the learning process.
Data Concepts Inferences

Figure 1.1: Components of learning process


1. Data storage
Facilities for storing and retrieving huge amounts of data are an important component
of the learning process. Humans and computers alike utilize data storage as a foundation for
advanced reasoning.
• In a human being, the data is stored in the brain and data is retrieved using electrochem-
ical signals.
• Computers use hard disk drives, flash memory, random access memory and similar de-
vices to store data and use cables and other technology to retrieve data.

2. Abstraction
The second component of the learning process is known as abstraction. Abstraction is
the process of extracting knowledge about stored data. This involves creating general concepts
about the data as a whole. The creation of knowledge involves application of known models
and creation of new models. The process of fitting a model to a dataset is known as training.
When the model has been trained, the data is transformed into an abstract form that summarizes
the original information.

3. Generalization
The third component of the learning process is known as generalisation. The term
generalization describes the process of turning the knowledge about stored data into a form
that can be utilized for future action. These actions are to be carried out on tasks that are similar,
but not identical, to those what have been seen before. In generalization, the goal is to discover
those properties of the data that will be most relevant to future tasks.

4. Evaluation
Evaluation is the last component of the learning process.It is the process of giving
feedback to the user to measure the utility of the learned knowledge. This feedback is then
utilised to effect improvements in the whole learning

AI&ML 3 R.THANIGAIVEL
1.3 TYPES OF MACHINE LEARNING
In general, machine learning algorithms can be classified into three types.

Supervised learning
Unsupervised learning

Reinforcement learning

1.5.1 Supervised learning


A training set of examples with the correct responses (targets) is provided and, based
on this training set, the algorithm generalises to respond correctly to all possible inputs. This
is also called learning from exemplars. Supervised learning is the machine learning task of
learning a function that maps an input to an output based on example input-output pairs.

In supervised learning, each example in the training set is a pair consisting of an input
object (typically a vector) and an output value. A supervised learning algorithm analyzes the
training data and produces a function, which can be used for mapping new examples. In the
optimal case, the function will correctly determine the class labels for unseen instances. Both
classification and regression problems are supervised learning problems. A wide range of
supervised learning algorithms are available, each with its strengths and weaknesses. There is
no single learning algorithm that works best on all supervised learning problems.

Remarks
A “supervised learning” is so called because the process of an algorithm learning from
the training dataset can be thought of as a teacher supervising the learning process. We know
the correct answers (that is, the correct outputs), the algorithm iteratively makes predictions on

AI&ML 4 R.THANIGAIVEL
the training data and is corrected by the teacher. Learning stops when the algorithm achieves
an acceptable level of performance.
Example
Consider the following data regarding patients entering a clinic. The data consists of the gender
and age of the patients and each patient is labeled as “healthy” or “sick”.

Gender Age Label


M 48 Sick

M 67 Sick
F 53 Healthy

M 49 Healthy
F 34 Sick
M 21 healthy

Unsupervised learning
Correct responses are not provided, but instead the algorithm tries to identify
similarities between the inputs so that inputs that have something in common are categorised
together. The statistical approach to unsupervised learning is known as density estimation.

Unsupervised learning is a type of machine learning algorithm used to draw inferences


from datasets consisting of input data without labeled responses. In unsupervised learning
algorithms, a classification or categorization is not included in the observations. There are no
output values and so there is no estimation of functions. Since the examples given to the learner
are unlabeled, the accuracy of the structure that is output by the algorithm cannot be evaluated.
The most common unsupervised learning method is cluster analysis, which is used for
exploratory data analysis to find hidden patterns or grouping in data

Gender Age
M 48
M 67

F 53
M 49

F 34
M 21

Based on this data, can we infer anything regarding the patients entering the clinic?

AI&ML 5 R.THANIGAIVEL
Reinforcement learning
This is somewhere between supervised and unsupervised learning. The algorithm gets
told when the answer is wrong, but does not get told how to correct it. It has to explore and try
out different possibilities until it works out how to get the answer right. Reinforcement learning
is sometime called learning with a critic because of this monitor that scores the answer, but
does not suggest improvements.
Reinforcement learning is the problem of getting an agent to act in the world so as to
maximize its rewards. A learner (the program) is not told what actions to take as in most forms
of machine learning, but instead must discover which actions yield the most reward by trying
them. In the most interesting and challenging cases, actions may affect not only the immediate
reward but also the next situations and, through that, all subsequent rewards.

Example
Consider teaching a dog a new trick: we cannot tell it what to do, but we can
reward/punish it if it does the right/wrong thing. It has to find out what it did that made it get
the reward/punishment. We can use a similar method to train computers to do many tasks, such
as playing backgammon or chess, scheduling jobs, and controlling robot limbs. Reinforcement
learning is different from supervised learning. Supervised learning is learning from examples
provided by a knowledgeable expert.

1.4 EXAMPLES OD MACHINE LEARNING

Learning Associations

In the case of retail—for example, a supermarket chain—one application of machine


learning is basket analysis, which is finding associations between products bought by
customers: If people who buy X typically also buy Y , and if there is a customer who buys X
and does not buy Y , he or she is a potential Y customer. Once we find such customers, we can
target them for cross-selling.

In finding an association rule, we are interested in learning a conditional association


rule probability of the form P (Y |X) where Y is the product we would like to condition on X,
which is the product or the set of products which we know that the customer has already
purchased. Let us say, going over our data, we calculate that P (chips|beer) = 0.7. Then, we can
define the rule:

70 percent of customers who buy beer also buy chips.

We may want to make a distinction among customers and toward this, estimate P (Y
|X, D) where D is the set of customer attributes, for example, gender, age, marital status, and
so on, assuming that we have access to this information. If this is a bookseller instead of a
supermarket, products can be books or authors. In the case of a Web portal, items correspond

AI&ML 6 R.THANIGAIVEL
to links to Web pages, and we can estimate the links a user is likely to click and use this
information to download such pages in advance for faster access.

Classification
A credit is an amount of money loaned by a financial institution, for example, a bank,
to be paid back with interest, generally in installments.It is important for the bank to be able to
predict in advance the risk associated with a loan, which is the probability that the customer
will default and not pay the whole amount back. This is both to make sure that the bank will
make a profit and also to not inconvenience a customer with a loan over his or her financial
capacity.

In credit scoring (Hand 1998), the bank calculates the risk given the amount of credit
and the information about the customer. The information about the customer includes data we
have access to and is relevant in calculating his or her financial capacity—namely, income,
savings, collaterals, profession, age, past financial history, and so forth. The bank has a record
of past loans containing such customer data and whether the loan was paid back or not. From
this data of particular applications, the aim is to infer a general rule coding the association
between a customer’s
attributes and his risk. That is, the machine learning system fits a model to the past data to be
able to calculate the risk for a new application and then decides to accept or refuse it
accordingly.

This is an example of a classification problem where there are two classification


classes: low-risk and high-risk customers. The information about a customer makes up the
input to the classifier whose task is to assign the input to one of the two classes.
After training with the past data, a classification rule learned may be
of the form

IF income> θ1 AND savings> θ2 THEN low-risk ELSE high-risk

for suitable values of θ1 and θ2 (see figure 1.1). This is an example of a discriminant;
it is a function that separates the examples of differentdiscriminant classes. Having a rule like
this, the main application is prediction: Once we haveprediction a rule that fits the past data, if
the future is similar to the past, then we can make correct predictions for novel instances. Given
a new application with a certain income and savings, we can easily decide whether it is low
risk or high-risk. In some cases, instead of making a 0/1 (low-risk/high-risk) type decision, we
may want to calculate a probability, namely, P (Y |X), where X are the customer attributes and
Y is 0 or 1 respectively for low-risk.

1.5 Applications of machine learning


Application of machine learning methods to large databases is called data mining. In data
mining, a large volume of data is processed to construct a simple model with valuable use, for
example, having high predictive accuracy.

AI&ML 7 R.THANIGAIVEL
The following is a list of some of the typical applications of machine learning.
1. In retail business, machine learning is used to study consumer behaviour.
2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control, and troubleshooting.
4. In medicine, learning programs are used for medical diagnosis.
5. In telecommunications, call patterns are analyzed for network optimization and maximizing
the quality of service.
6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed
fast
enough by computers. The World Wide Web is huge; it is constantly growing and searching
for relevant information cannot be done manually.
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
9. Machine learning methods are applied in the design of computer-controlled vehicles to steer
correctly when driving on a variety of roads.
10. Machine learning methods have been used to develop programmes for playing games such
as chess, backgammon and Go.
1.6 Linear models for Regression:

Linear Regression

Linear regression is a type of supervised machine learning algorithm that computes the linear
relationship between the dependent variable and one or more independent features by fitting
a linear equation to observed data.
When there is only one independent feature, it is known as Simple Linear Regression, and
when there are more than one feature, it is known as Multiple Linear Regression.
Similarly, when there is only one dependent variable, it is considered Univariate Linear
Regression, while when there are more than one dependent variables, it is known
as Multivariate Regression.

Why Linear Regression is Important


The interpretability of linear regression is a notable strength. The model’s equation
provides clear coefficients that elucidate the impact of each independent variable on the
dependent variable, facilitating a deeper understanding of the underlying dynamics. Its
simplicity is a virtue, as linear regression is transparent, easy to implement, and serves as a
foundational concept for more complex algorithms.
Linear regression is not merely a predictive tool; it forms the basis for various
advanced models. Techniques like regularization and support vector machines draw
inspiration from linear regression, expanding its utility. Additionally, linear regression is a
cornerstone in assumption testing, enabling researchers to validate key assumptions about
the data.

AI&ML 8 R.THANIGAIVEL
Types of Linear Regression
There are two main types of linear regression:

Simple Linear Regression


This is the simplest form of linear regression, and it involves only one independent variable
and one dependent variable. The equation for simple linear regression is:
y=β0+β1X
 Y is the dependent variable
 X is the independent variable
 β0 is the intercept
 β1 is the slope

1.6.1 Multiple Linear Regression
This involves more than one independent variable and one dependent variable. The equation
for multiple linear regression is:
y=β0+β1X+β2X+………βnX
where:
 Y is the dependent variable
 X1, X2, …, Xp are the independent variables
 β0 is the intercept
 β1, β2, …, βn are the slopes

The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
In regression set of records are present with X and Y values and these values are used to learn
a function so if you want to predict Y from an unknown X this learned function can be used.
In regression we have to find the value of Y, So, a function is required that predicts
continuous Y in the case of regression given X as independent features.

Best Fit Line


Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a minimum.
There will be the least error in the best-fit line.
The best Fit Line equation provides a straight line that represents the relationship between
the dependent and independent variables. The slope of the line indicates how much the
dependent variable changes for a unit change in the independent variable(s).

AI&ML 9 R.THANIGAIVEL
Linear Regression

Here Y is called a dependent or target variable and X is called an independent variable also
known as the predictor of Y. There are many types of functions or modules that can be used
for regression. A linear function is the simplest type of function. Here, X may be a single
feature or multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression. In the figure above, X
(input) is the work experience and Y (output) is the salary of a person. The regression line is
the best-fit line for our model.
We utilize the cost function to compute the best values in order to get the best fit line since
different values for weights or the coefficient of lines result in different regression lines.
Hypothesis function in Linear Regression
As we have assumed earlier that our independent feature is the experience i.e X and the
respective salary Y is the dependent variable. Let’s assume there is a linear relationship
between X and Y then the salary can be predicted using:
Y^=θ1+θ2X
OR
y^i=θ1+θ2xi
Here,
 yiϵY(i=1,2,⋯,n) are labels to data (Supervised learning)
 xiϵX(i=1,2,⋯,n) are the input independent training data (univariate – one input
variable(parameter))
 yi^ϵY^(i=1,2,⋯,n) are the predicted values.
The model gets the best regression fit line by finding the best θ1 and θ2 values.
 θ1: intercept
 θ2: coefficient of x

AI&ML 10 R.THANIGAIVEL
Once we find the best θ1 and θ2 values, we get the best-fit line. So when we are finally using
our model for prediction, it will predict the value of y for the input value of x.

How to update θ1 and θ2 values to get the best-fit line?


To achieve the best-fit regression line, the model aims to predict the target
value Y^ such that the error difference between the predicted value Y^ and the true
value Y is minimum. So, it is very important to update the θ1 and θ2 values, to reach the best
value that minimizes the error between the predicted y value (pred) and the true y value (y).

2minimizen1∑i=1n(yi^−yi)2

Cost function for Linear Regression

The cost function or the loss function is nothing but the error or difference between
the predicted value Y^ and the true value Y.
In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which
calculates the average of the squared errors between the predicted values y^i and the actual
values yi. The purpose is to determine the optimal values for the intercept 1θ1 and the
coefficient of the input feature 2θ2 providing the best-fit line for the given data points. The
linear equation expressing this relationship is y^i=θ1+θ2xi.

MSE function can be calculated as:


Cost function(2Cost function(J)=n1∑ni(yi^−yi)2

Utilizing the MSE function, the iterative process of gradient descent is applied to
update the values of 2θ1&θ2. This ensures that the MSE value converges to the global
minima, signifying the most accurate fit of the linear regression line to the dataset.
This process involves continuously adjusting the parameters \(\theta_1\) and \(\theta_2\)
based on the gradients calculated from the MSE. The final result is a linear regression line
that minimizes the overall squared differences between the predicted and actual values,
providing an optimal representation of the underlying relationship in the data.

Gradient Descent for Linear Regression

A linear regression model can be trained using the optimization algorithm gradient
descent by iteratively modifying the model’s parameters to reduce the mean squared error
(MSE) of the model on a training dataset. To update θ1 and θ2 values in order to reduce the
Cost function (minimizing RMSE value) and achieve the best-fit line the model uses Gradient
Descent. The idea is to start with random θ1 and θ2 values and then iteratively update the
values, reaching minimum cost.
A gradient is nothing but a derivative that defines the effects on outputs of the function with
a little bit of variation in inputs.
Let’s differentiate the cost function(J) with respect to

AI&ML 11 R.THANIGAIVEL
Finding the coefficients of a linear equation that best fits the training data is the objective of
linear regression. By moving in the direction of the Mean Squared Error negative gradient
with respect to the coefficients, the coefficients can be changed. And the respective intercept
and coefficient of X will be if � α is the learning rate.

Assumptions of Simple Linear Regression


Linear regression is a powerful tool for understanding and predicting the behavior of a
variable, however, it needs to meet a few conditions in order to be accurate and dependable
solutions.
1. Linearity: The independent and dependent variables have a linear relationship with one
another. This implies that changes in the dependent variable follow those in the
independent variable(s) in a linear fashion. This means that there should be a straight line
that can be drawn through the data points. If the relationship is not linear, then linear
regression will not be an accurate model.

2. Independence: The observations in the dataset are independent of each other. This means
that the value of the dependent variable for one observation does not depend on the value
of the dependent variable for another observation. If the observations are not independent,
then linear regression will not be an accurate model.
3. Homoscedasticity: Across all levels of the independent variable(s), the variance of the
errors is constant. This indicates that the amount of the independent variable(s) has no
impact on the variance of the errors. If the variance of the residuals is not constant, then
linear regression will not be an accurate model.

Homoscedasticity in Linear Regression

AI&ML 12 R.THANIGAIVEL
4. Normality: The residuals should be normally distributed. This means that the residuals
should follow a bell-shaped curve. If the residuals are not normally distributed, then linear
regression will not be an accurate model.
Assumptions of Multiple Linear Regression
For Multiple Linear Regression, all four of the assumptions from Simple Linear Regression
apply. In addition to this, below are few more:
1. No multicollinearity: There is no high correlation between the independent variables.
This indicates that there is little or no correlation between the independent variables.
Multicollinearity occurs when two or more independent variables are highly correlated
with each other, which can make it difficult to determine the individual effect of each
variable on the dependent variable. If there is multicollinearity, then multiple linear
regression will not be an accurate model.
2. Additivity: The model assumes that the effect of changes in a predictor variable on the
response variable is consistent regardless of the values of the other variables. This
assumption implies that there is no interaction between variables in their effects on the
dependent variable.
3. Feature Selection: In multiple linear regression, it is essential to carefully select the
independent variables that will be included in the model. Including irrelevant or redundant
variables may lead to overfitting and complicate the interpretation of the model.
4. Overfitting: Overfitting occurs when the model fits the training data too closely,
capturing noise or random fluctuations that do not represent the true underlying
relationship between variables. This can lead to poor generalization performance on new,
unseen data.
Multicollinearity
Multicollinearity is a statistical phenomenon that occurs when two or more independent
variables in a multiple regression model are highly correlated, making it difficult to assess
the individual effects of each variable on the dependent variable.
Detecting Multicollinearity includes two techniques:
 Correlation Matrix: Examining the correlation matrix among the independent variables
is a common way to detect multicollinearity. High correlations (close to 1 or -1) indicate
potential multicollinearity.
 VIF (Variance Inflation Factor): VIF is a measure that quantifies how much the variance
of an estimated regression coefficient increases if your predictors are correlated. A high
VIF (typically above 10) suggests multicollinearity.
Evaluation Metrics for Linear Regression
A variety of evaluation measures can be used to determine the strength of any linear
regression model. These assessment metrics often give an indication of how well the model
is producing the observed outputs.
The most common measurements are:
Mean Square Error (MSE)
Mean Squared Error (MSE) is an evaluation metric that calculates the average of the squared
differences between the actual and predicted values for all the data points. The difference is
squared to ensure that negative and positive differences don’t cancel each other out.
2MSE=n1∑i=1n(yi–yi)2

AI&ML 13 R.THANIGAIVEL
Here,
 n is the number of data points.
 yi is the actual or observed value for the ith data point.
 ��^yi is the predicted value for the ith data point.
MSE is a way to quantify the accuracy of a model’s predictions. MSE is sensitive to outliers
as large errors contribute significantly to the overall score.
Mean Absolute Error (MAE)
Mean Absolute Error is an evaluation metric used to calculate the accuracy of a regression
model. MAE measures the average absolute difference between the predicted values and
actual values.
Mathematically, MAE is expressed as:
MAE=n1∑i=1n∣Yi–Yi∣
Here,
 n is the number of observations
 Yi represents the actual values.
 Yi represents the predicted values
Lower MAE value indicates better model performance. It is not sensitive to the outliers as
we consider absolute differences.
Root Mean Squared Error (RMSE)
The square root of the residuals’ variance is the Root Mean Squared Error. It describes how
well the observed data points match the expected values, or the model’s absolute fit to the
data.

In mathematical notation, it can be expressed as:


RMSE=nRSS=n∑i=2n(yiactual−yipredicted)2
Rather than dividing the entire number of data points in the model by the number of degrees
of freedom, one must divide the sum of the squared residuals to obtain an unbiased estimate.
Then, this figure is referred to as the Residual Standard Error (RSE).
In mathematical notation, it can be expressed as:
RMSE=nRSS=(n−2)∑i=2n(yiactual−yipredicted)2
RSME is not as good of a metric as R-squared. Root Mean Squared Error can fluctuate when
the units of the variables vary since its value is dependent on the variables’ units (it is not a
normalized measure).
Coefficient of Determination (R-squared)
R-Squared is a statistic that indicates how much variation the developed model can explain
or capture. It is always in the range of 0 to 1. In general, the better the model matches the
data, the greater the R-squared number.
In mathematical notation, it can be expressed as:)R2=1−(TSSRSS)
 Residual sum of Squares (RSS): The sum of squares of the residual for each data point
in the plot or data is known as the residual sum of squares, or RSS. It is a measurement of
the difference between the output that was observed and what was anticipated.
2RSS=∑i=2n(yi−b0−b1xi)2
 Total Sum of Squares (TSS): The sum of the data points’ errors from the answer
variable’s mean is known as the total sum of squares, or TSS.

AI&ML 14 R.THANIGAIVEL
 2TSS=∑(y−yi)2
R squared metric is a measure of the proportion of variance in the dependent variable that is
explained the independent variables in the model.
Adjusted R-Squared Error
Adjusted R2 measures the proportion of variance in the dependent variable that is explained
by independent variables in a regression model. Adjusted R-square accounts the number of
predictors in the model and penalizes the model for including irrelevant predictors that don’t
contribute significantly to explain the variance in the dependent variables.
Mathematically, adjusted R2 is expressed as:
AdjustedR2=1–(n−k−1(1−R2).(n−1))
Here,
 n is the number of observations
 k is the number of predictors in the model
 R2 is coeeficient of determination
Adjusted R-square helps to prevent overfitting. It penalizes the model with additional
predictors that do not contribute significantly to explain the variance in the dependent
variable.

1.7 LINEAR BASIS FUNCTION MODELS

Given a set of input dataset of N samples {xn}, where n = 1, … , N, as well as the corresponding
target values {t n}, the goal is to deduce the value of t for new value of x. The set of input data
set together with the corresponding target values t is known as the training data set.

On way to handle this is by constructing a function y(x) that maps x to t such that:
y(x) = t

for a new input value of x.


Then we can examine this model by finding the probability that the results are correct. This
means that we need to examine the probability of t given x

p(t|x)

Constructing the Linear Basis Function


The basic linear model for regression is a model that involves a linear combination of the input
variables:
y(w,x) = wo + w1x1 + w2x2 + … + wDxD

where x = (x1, x2, … ,xD)T

AI&ML 15 R.THANIGAIVEL
This is what is generally known as linear regression.
The key attribute of this function is that it is a linear function of the parameters w0, w1,…, wD. It
is also a linear function of the input variable x. Being a linear function of the input variable x,
limits the usefulness of the function. This is because most of the observations that may be
encountered does not necessarily follow a linear relationship. To solve this problem consider
modifying to model to be a combination of fixed non-linear functions of the input variable.

If we assume that the non-linear function of the input variable is φ(x), then we can re-write the
original function as :

y(x,w) = w0 + w1φ(x1) + w2φ(x2) + … + wDφ(xD)

Summing it up, we will have:

where φ(x) are known as basis functions.

The total number of parameters in this function will be M, therefore the summation of terms is
from j = 1 to M.
The parameter w0 is known as the bias parameter which allows for a fixed offset in the data.

1.8 Bias-Variance Decomposition

The bias is defined as the difference between the ML model’s prediction of the values and the
correct value. Biasing causes a substantial inaccuracy in both training and testing data. To
prevent the problem of underfitting, it is advised that an algorithm be low biased at all times.

The data predicted with high bias is in a straight-line format, which does not fit the data in the
data set adequately. Underfitting of data is a term used to describe this type of fitting. This
occurs when the theory is overly simplistic or linear in form.

The variance of the model is the variability of model prediction for a particular data point,
which tells us about the dispersion of the data. The model with high variance has a very
complicated fit to the training data and so is unable to fit correctly on new data.

As a result, while such models perform well on training data, they have large error rates on test
data. When a model has a large variance, this is referred to as Overfitting of Data. Variability
should be reduced to a minimum while training a data model.
AI&ML 16 R.THANIGAIVEL
Bias and variance are negatively related, therefore it is essentially difficult to have an ML
model with both a low bias and a low variance. When we alter the ML method to better match
a specific data set, it results in reduced bias but increases variance. In this manner, the model
will fit the data set while increasing the likelihood of incorrect predictions.

The same is true when developing a low variance model with a bigger bias. The model will not
fully fit the data set, even though it will lower the probability of erroneous predictions. As a
result, there is a delicate balance between biases and variance.

When to use bias-variance decomposition

Since bias and variance are connected to underfitting and overfitting, decomposing the loss
into bias and variance helps us understand learning algorithms. Let’s understand certain
attributes.

AI&ML 17 R.THANIGAIVEL
 Low Bias: Tends to suggest fewer implications about the target function’s shape.
 High-Bias: Suggests additional assumptions about the target function’s shape.
 Low Variance: Suggests minor changes to the target function estimate when the
training dataset changes.
 High Variance: Suggests that changes to the training dataset cause considerable
variations in the target function estimate.

Theoretically, a model should have low bias and low variance but this is impossible to achieve.
So, an optimal bias and variance are acceptable. Linear models have low variance but high bias
and non-linear models have low bias but high variance.

Working of bias

The total error of a machine learning algorithm has three components: bias, variance and noise.
So decomposition is the process of derivation of total error in this case we are taking Mean
Squared Error (MSE).

Total error = Bias2 + Variance + Noise

Suppose we have a regression problem where we take in vectors and try to make predictions
of a single value. Suppose for the moment that we know the absolute true answer up to an
independent random noise. The noise should be independent of any randomness inherent in the
vector and should have a mean of zero, so that function is the best possible guess.

In the above function “R(h)” which is the cost function of the algorithm also known as the risk
function. When the risk function is loss it is the squared error. The expected function which is
represented by “E” in the above equation contains the random variables. Calculate the average
of the probability distributions for hypothesis “h”.

The data x and y are derived from the probability distribution on which the learner will be
trained. Since the weights are selected based on the training data, the weights that define h are
also obtained from the probability distribution. It can be difficult to determine this distribution,
but it does exist. The expectation function consolidates the losses of all potential weight values.

AI&ML 18 R.THANIGAIVEL
In the above image after doing all the mathematical derivation, we can observe that at last the
three components are derived bias, variance and irreducible error or noise.

Let’s understand this with an example.

In this example, we’re attempting to match a sine wave with lines, which are obviously not
realistic. On the left, we produced 50 distinct lines. The red line in the top right corner
represents the anticipated hypothesis which is an average of infinitely many possibilities. The
black curve depicts test locations along with the true function.

Because lines do not match sine waves well, we notice that most test points have a substantial
bias. Here the bias is the squared difference between the black and red curves.

Some of the test locations, however, exhibit a slight bias, where the sine wave crosses the red
line. The variance in the middle represents the predicted squared difference between a random
black line and the red line. The irreducible error is the predicted squared difference between a
random test point and the sine wave.

AI&ML 19 R.THANIGAIVEL
1.9 Bayesian Linear Regression

In the Bayesian viewpoint, we formulate linear regression using probability distributions rather
than point estimates. The response, y, is not estimated as a single value, but is assumed to be
drawn from a probability distribution. The model for Bayesian Linear Regression with the
response sampled from a normal distribution is:

The output, y is generated from a normal (Gaussian) Distribution characterized by a mean and
variance. The mean for linear regression is the transpose of the weight matrix multiplied by the
predictor matrix. The variance is the square of the standard deviation σ (multiplied by the
Identity matrix because this is a multi-dimensional formulation of the model).

The aim of Bayesian Linear Regression is not to find the single “best” value of the model
parameters, but rather to determine the posterior distribution for the model parameters. Not only
is the response generated from a probability distribution, but the model parameters are assumed
to come from a distribution as well. The posterior probability of the model parameters is
conditional upon the training inputs and outputs:

Here, P(β|y, X) is the posterior probability distribution of the model parameters given the inputs
and outputs. This is equal to the likelihood of the data, P(y|β, X), multiplied by the prior
probability of the parameters and divided by a normalization constant. This is a simple
expression of Bayes Theorem, the fundamental underpinning of Bayesian Inference:

Let’s stop and think about what this means. In contrast to OLS, we have a
posterior distribution for the model parameters that is proportional to the likelihood of the data
multiplied by the prior probability of the parameters. Here we can observe the two primary
benefits of Bayesian Linear Regression.

AI&ML 20 R.THANIGAIVEL
1. Priors: If we have domain knowledge, or a guess for what the model parameters should be,
we can include them in our model, unlike in the frequentist approach which assumes
everything there is to know about the parameters comes from the data. If we don’t have any
estimates ahead of time, we can use non-informative priors for the parameters such as a
normal distribution.

2. Posterior: The result of performing Bayesian Linear Regression is a distribution of possible


model parameters based on the data and the prior. This allows us to quantify our uncertainty
about the model: if we have fewer data points, the posterior distribution will be more spread
out.

As the amount of data points increases, the likelihood washes out the prior, and in the case of
infinite data, the outputs for the parameters converge to the values obtained from OLS.

The formulation of model parameters as distributions encapsulates the Bayesian worldview: we


start out with an initial estimate, our prior, and as we gather more evidence, our model becomes
less wrong. Bayesian reasoning is a natural extension of our intuition. Often, we have an initial
hypothesis, and as we collect data that either supports or disproves our ideas, we change our
model of the world (ideally this is how we would reason)!

Implementing Bayesian Linear Regression

In practice, evaluating the posterior distribution for the model parameters is intractable for
continuous variables, so we use sampling methods to draw samples from the posterior in order
to approximate the posterior. The technique of drawing random samples from a distribution to
approximate the distribution is one application of Monte Carlo methods. There are a number of
algorithms for Monte Carlo sampling, with the most common being variants of Markov Chain
Monte Carlo (see this post for an application in Python).

Bayesian Linear Modeling Application

I’ll skip the code for this post (see the notebook for the implementation in PyMC3) but the basic
procedure for implementing Bayesian Linear Regression is: specify priors for the model
parameters (I used normal distributions in this example), creating a model mapping the training
inputs to the training outputs, and then have a Markov Chain Monte Carlo (MCMC) algorithm
draw samples from the posterior distribution for the model parameters. The end result will be
posterior distributions for the parameters. We can inspect these distributions to get a sense of
what is occurring.

The first plots show the approximations of the posterior distributions of model parameters.
These are the result of 1000 steps of MCMC, meaning the algorithm drew 1000 steps from the
posterior distribution.

AI&ML 21 R.THANIGAIVEL
If we compare the mean values for the slope and intercept to those obtained from OLS (the
intercept from OLS was -21.83 and the slope was 7.17), we see that they are very similar.
However, while we can use the mean as a single point estimate, we also have a range of possible
values for the model parameters. As the number of data points increases, this range will shrink
and converge one a single value representing greater confidence in the model parameters. (In
Bayesian inference a range for a variable is called a credible interval and which has a slightly
different interpretation from a confidence interval in frequentist inference).

When we want show the linear fit from a Bayesian model, instead of showing only estimate, we
can draw a range of lines, with each one representing a different estimate of the model

AI&ML 22 R.THANIGAIVEL
parameters. As the number of datapoints increases, the lines begin to overlap because there is
less uncertainty in the model parameters.

In order to demonstrate the effect of the number of datapoints in the model, I used two models,
the first, with the resulting fits shown on the left, used 500 datapoints and the one on the right
used 15000 datapoints. Each graph shows 100 possible models drawn from the model parameter
posteriors.

Bayesian Linear Regression Model Results with 500 (left) and 15000 observations (right)

There is much more variation in the fits when using fewer data points, which represents a greater
uncertainty in the model. With all of the data points, the OLS and Bayesian Fits are nearly
identical because the priors are washed out by the likelihoods from the data. When predicting
the output for a single datapoint using our Bayesian Linear Model, we also do not get a single
value but a distribution. Following is the probability density plot for the number of calories
burned exercising for 15.5 minutes. The red vertical line indicates the point estimate from OLS.

AI&ML 23 R.THANIGAIVEL
Posterior Probability Density of Calories Burned from Bayesian Model

We see that the probability of the number of calories burned peaks around 89.3, but the full
estimate is a range of possible values.

1.10 Dimensionality Reduction


Dimensionality reduction is the process of reducing the number of features (or
dimensions) in a dataset while retaining as much information as possible. This can be done
for a variety of reasons, such as to reduce the complexity of a model, to improve the
performance of a learning algorithm, or to make it easier to visualize the data. There are
several techniques for dimensionality reduction, including principal component analysis
(PCA), singular value decomposition (SVD), and linear discriminant analysis (LDA). Each
technique uses a different method to project the data onto a lower-dimensional space while
preserving important information.
Dimensionality reduction is a technique used to reduce the number of features in a dataset
while retaining as much of the important information as possible. In other words, it is a
process of transforming high-dimensional data into a lower-dimensional space that still
preserves the essence of the original data.
In machine learning, high-dimensional data refers to data with a large number of features or
variables. The curse of dimensionality is a common problem in machine learning, where the
performance of the model deteriorates as the number of features increases. This is because
the complexity of the model increases with the number of features, and it becomes more
difficult to find a good solution. In addition, high-dimensional data can also lead to
overfitting, where the model fits the training data too closely and does not generalize well to
new data.

AI&ML 24 R.THANIGAIVEL
Dimensionality reduction can help to mitigate these problems by reducing the complexity of
the model and improving its generalization performance. There are two main approaches to
dimensionality reduction: feature selection and feature extraction.
Feature Selection:
Feature selection involves selecting a subset of the original features that are most relevant to
the problem at hand. The goal is to reduce the dimensionality of the dataset while retaining
the most important features. There are several methods for feature selection, including filter
methods, wrapper methods, and embedded methods. Filter methods rank the features based
on their relevance to the target variable, wrapper methods use the model performance as the
criteria for selecting features, and embedded methods combine feature selection with the
model training process.
Feature Extraction:
Feature extraction involves creating new features by combining or transforming the original
features. The goal is to create a set of features that captures the essence of the original data
in a lower-dimensional space. There are several methods for feature extraction, including
principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed
stochastic neighbor embedding (t-SNE). PCA is a popular technique that projects the original
features onto a lower-dimensional space while preserving as much of the variance as
possible.
Why is Dimensionality Reduction important in Machine Learning and Predictive
Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these
features may overlap. In another condition, a classification problem that relies on both
humidity and rainfall can be collapsed into just one underlying feature, since both of the
aforementioned are correlated to a high degree. Hence, we can reduce the number of features
in such problems. A 3-D classification problem can be hard to visualize, whereas a 2-D one
can be mapped to a simple 2-dimensional space, and a 1-D problem to a simple line. The
below figure illustrates this concept, where a 3-D feature space is split into two 2-D feature
spaces, and later, if found to be correlated, the number of features can be reduced even
further.

AI&ML 25 R.THANIGAIVEL
Components of Dimensionality Reduction

here are two components of dimensionality reduction:


 Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear and non-linear, depending upon the method
used. The prime linear method, called Principal Component Analysis, or PCA, is discussed
below.
Principal Component Analysis
This method was introduced by Karl Pearson. It works on the condition that while the data
in a higher dimensional space is mapped to data in a lower dimension space, the variance of
the data in the lower dimensional space should be maximum.
AI&ML 26 R.THANIGAIVEL
It involves the following steps:
 Construct the covariance matrix of the data.
 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some
data loss in the process. But, the most important variances should be retained by the
remaining eigenvectors.
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
 Improved Visualization: High dimensional data is difficult to visualize, and
dimensionality reduction techniques can help in visualizing the data in 2D or 3D, which
can help in better understanding and analysis.
 Overfitting Prevention: High dimensional data may lead to overfitting in machine learning
models, which can lead to poor generalization performance. Dimensionality reduction can
help in reducing the complexity of the data, and hence prevent overfitting.
 Feature Extraction: Dimensionality reduction can help in extracting important features
from high dimensional data, which can be useful in feature selection for machine learning
models.
 Data Preprocessing: Dimensionality reduction can be used as a preprocessing step before
applying machine learning algorithms to reduce the dimensionality of the data and hence
improve the performance of the model.
 Improved Performance: Dimensionality reduction can help in improving the performance
of machine learning models by reducing the complexity of the data, and hence reducing
the noise and irrelevant information in the data.

Disadvantages of Dimensionality Reduction
 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes undesirable.

AI&ML 27 R.THANIGAIVEL
 PCA fails in cases where mean and covariance are not enough to define datasets.
 We may not know how many principal components to keep- in practice, some thumb rules
are applied.
 Interpretability: The reduced dimensions may not be easily interpretable, and it may be
difficult to understand the relationship between the original features and the reduced
dimensions.
 Overfitting: In some cases, dimensionality reduction may lead to overfitting, especially
when the number of components is chosen based on the training data.
 Sensitivity to outliers: Some dimensionality reduction techniques are sensitive to outliers,
which can result in a biased representation of the data.
 Computational complexity: Some dimensionality reduction techniques, such as manifold
learning, can be computationally intensive, especially when dealing with large datasets.

Important points:

 Dimensionality reduction is the process of reducing the number of features in a dataset


while retaining as much information as possible.
This can be done to reduce the complexity of a model, improve the performance of a
learning algorithm, or make it easier to visualize the data.
 Techniques for dimensionality reduction include: principal component analysis (PCA),
singular value decomposition (SVD), and linear discriminant analysis (LDA).
 Each technique projects the data onto a lower-dimensional space while preserving
important information.
 Dimensionality reduction is performed during pre-processing stage before building a
model to improve the performance
 It is important to note that dimensionality reduction can also discard useful information,
so care must be taken when applying these techniques.

AI&ML 28 R.THANIGAIVEL
UNIT 2
NEURAL NETWORKS
2.1 BIOLOGICAL NEURONS AND THEIR ARTIFICIAL NEURONS
Biological Neurons
Neurons are the basic functional units of the nervous system, and they generate electrical
signals called action potentials, which allows them to quickly transmit information over long
distances.
Almost all the neurons have three basic functions essential for the normal functioning of all the
cells in the body.
These are to:
1. Receive signals (or information) from outside.
2. Process the incoming signals and determine whether or not the information should be passed
along.
3. Communicate signals to target cells which might be other neurons or muscles or glands.
Now let us understand the basic parts of a neuron to get a deeper insight into how they actually
work…
A biological neuron is mainly composed of 3 main parts and an external part called synapse:-

1. Dendrite
Dendrites are responsible for getting incoming signals from outside
2. Soma
Soma is the cell body responsible for the processing of input signals and deciding whether a
neuron should fire an output signal
3. Axon
Axon is responsible for getting processed signals from neuron to relevant cells
4. Synapse
Synapse is the connection between an axon and other neuron dendrites
Working of the parts
The task of receiving the incoming information is done by dendrites, and processing generally
takes place in the cell body. Incoming signals can be either excitatory — which means they
tend to make the neuron fire (generate an electrical impulse) — or inhibitory — which means
that they tend to keep the neuron from firing.
Most neurons receive many input signals throughout their dendritic trees. A single neuron may
have more than one set of dendrites and may receive many thousands of input signals. Whether
or not a neuron is excited into firing an impulse depends on the sum of all of the excitatory and
inhibitory signals it receives. The processing of this information happens in soma which is

AI&ML 29 R.THANIGAIVEL
neuron cell body. If the neuron does end up firing, the nerve impulse, or action potential, is
conducted down the axon.
Towards its end, the axon splits up into many branches and develops bulbous swellings known
as axon terminals (or nerve terminals). These axon terminals make connections on target
cells.

Artificial Neurons
Artificial neuron also known as perceptron is the basic unit of the neural network. In simple
terms, it is a mathematical function based on a model of biological neurons. It can also be seen
as a simple logic gate with binary outputs. They are sometimes also called perceptrons.
Each artificial neuron has the following main functions:
1. Takes inputs from the input layer
2. Weighs them separately and sums them up
3. Pass this sum through a nonlinear function to produce output.

Basic perceptron diagram


The perceptron(neuron) consists of 4 parts:
1. Input values or One input layer
We pass input values to a neuron using this layer. It might be something as simple as a
collection of array values. It is similar to a dendrite in biological neurons.
2. Weights and Bias
Weights are a collection of array values which are multiplied to the respective input values.
We then take a sum of all these multiplied values which is called a weighted sum. Next,
we add a bias value to the weighted sum to get final value for prediction by our neuron.

AI&ML 30 R.THANIGAIVEL
3. Activation Function
Activation Function decides whether or not a neuron is fired. It decides which of the two
output values should be generated by the neuron.
4. Output Layer
Output layer gives the final output of a neuron which can then be passed to other neurons
in the network or taken as the final output value.
Now, all the above concepts might seem like too much theoretical knowledge without any
practical insights, so let’s understand the working of an artificial neuron with an example.
Consider a neuron with two inputs (x1,x2) as shown below:

Single-layer neuron example


1. The values of the two inputs(x1,x2) are 0.8 and 1.2
2. We have a set of weights (1.0,0.75) corresponding to the two inputs
3. Then we have a bias with value 0.5 which needs to be added to the sum
The input to activation function is then calculated using the formula:-

AI&ML 31 R.THANIGAIVEL
Calculation of `C` value
Now the combination(C) can be fed to the activation function. Let us first understand the logic
of Rectified linear (ReLU) activation function which we are currently using in our example:

ReLU activation function


In our case, the combination value we got was 2.2 which is greater than 0 so the output value of
our activation function will be 2.2.
This will be the final output value of our single layer neuron.

Biological Neuron vs. Artificial Neuron


Since we have learnt a bit about both biological and artificial neurons we can now draw
comparisons between both as follows:

Similarities between biological and artificial neuron

INTRODUCTION
Neural network learning methods provide a robust approach to approximating real-
valued, discrete-valued, and vector-valued target functions. For certain types of problems, such
as learning to interpret complex real-world sensor data, artificial neural networks are among
the most effective learning methods currently known. For example, the
BACKPROPAGATION algorithm described in this chapter has proven surprisingly successful
in many practical problems such as learning to recognize handwritten characters (LeCun et al.
1989), learning to recognize spoken words (Lang et al. 1990), and learning to recognize faces
(Cottrell 1990). One survey of practical applications is provided by Rumelhart et al. (1994).

Biological Motivation
The study of artificial neural networks (ANNs) has been inspired in part by the
observation that biological learning systems are built of very complex webs of interconnected
neurons. In rough analogy, artificial neural networks are built out of a densely interconnected
set of simple units, where each unit takes a number of real-valued inputs (possibly the outputs

AI&ML 32 R.THANIGAIVEL
of other units) and produces a single real-valued output (which may become the input to many
other units).

To develop a feel for this analogy, let us consider a few facts from neurobiology. The
human brain, for example, is estimated to contain a densely interconnected network of
approximately 1011neurons, each connected, on average, to lo4others. Neuron activity is
typically excited or inhibited through connections to other neurons. The fastest neuron
switching times are known to be on the order of loe3 seconds--quite slow compared to
computer switching speeds of 10-lo seconds. Yet humans are able to make surprisingly
complex decisions, surprisingly quickly. For example, it requires approximately lo-' secondsto
visually recognize your mother. Notice the sequence of neuron firings that can take place
during this 10-'-second interval cannot possibly be longer than a few hundred steps, given the
switching speed of single neurons.

This observation has led many to speculate that the information-processingabilities of


biological neural systems must follow from highly parallel processes operating on
representations that are distributed over many neurons. One motivation for ANN systems is to
capture this kind of highly parallel computation based on distributed representations. Most
ANN software runs on sequential machines emulating distributed processes, although faster
versions of the algorithms have also been implemented on highly parallel machines and on
specialized hardware designed specifically for ANN applications.

While ANNs are loosely motivated by biological neural systems, there are many
complexities to biological neural systems that are not modeled by ANNs, and many features of
the ANNs we discuss here are known to be inconsistent with biological systems. For example,
we consider here ANNs whose individual units output a single constant value, whereas
biological neurons output a complex time series of spikes.
Historically, two groups of researchers have worked with artificial neural networks.
One group has been motivated by the goal of using ANNs to study and model biological
learning processes. A second group has been motivated by the goal of obtaining highly
effective machine learning algorithms, independent of whether these algorithmsmirror
biological processes. Within this book our interest fits the latter group, and therefore we will
not dwell further on biological modeling. For more information on attempts to model biological
systems using ANNs, see, for example, Churchland and Sejnowski (1992); Zornetzer et al.
(1994); Gabriel and Moore (1990).

NEURAL NETWORK REPRESENTATIONS

A prototypical example of ANN learning is provided by Pomerleau's (1993) system


ALVINN, which uses a learned ANN to steer an autonomous vehicle driving
at normal speeds on public highways. The input to the neural network is a 30 x 32 grid of pixel
intensities obtained from a forward-pointed camera mounted on the vehicle. The network
output is the direction in which the vehicle is steered. The ANN is trained to mimic the
observed steering commands of a human driving the vehicle for approximately 5 minutes.
ALVINN has used its learned networks to successfully drive at speeds up to 70 miles per hour

AI&ML 33 R.THANIGAIVEL
and for distances of 90 miles on public highways (driving in the left lane of a divided public
highway, with other vehicles present).

The neural network representation used in one version of the ALVINN system, and
illustrates the kind of representation typical of many ANN systems. The network is shown on
the left side of the figure, with the input camera image depicted below it. Each node (i.e., circle)
in the network diagram corresponds to the output of a single network unit,and the lines entering
the node from below are its inputs. As can be seen, there are four units that receive inputs
directly from all of the 30 x 32 pixels in the image. These are called "hidden" units because
their output is available only within the network and is not available as part of the global
network output. Each of these four hidden units computes a single real-valued output based on
a weighted combinationof its 960 inputs. These hidden unit outputs are then used as inputs to
a second layer of 30 "output" units. Each output unit corresponds to a particular steering
direction, and the output values of these units determine which steering direction is
recommended most strongly.

The diagrams on the right side of the figure depict the learned weight values associated
with one of the four hidden units in this ANN. The large matrix of black and white boxes on
the lower right depicts the weights from the 30x 32 pixel inputs into the hidden unit. Here, a
white box indicates a positive weight, a black box a negative weight, and the size of the box
indicates the weight magnitude. The smaller rectangular diagram directly above the large
matrix shows the weights from this hidden unit to each of the 30 output units.

The network structure of ALYINN is typical of many ANNs. Here the individual units
are interconnected in layers that form a directed acyclic graph. In general, ANNs can be graphs
with many types of structures-acyclic or cyclic, directed or undirected. This chapter will focus
on the most common and practical ANN approaches, which are based on the
BACKPROPAGATION algorithm. The BACKPROPAGATION algorithm assumes the
network is a fixed structure that corresponds to a directed graph, possibly containing cycles.
Learning corresponds to choosing a weight value for each edge in the graph. Although certain
types of cycles are allowed, the vast majority of practical applications involve acyclic feed-
forward networks, similar to the network structure used by ALVINN.

2.2 LEARNING RULES

APPROPRIATE PROBLEMS FOR NEURAL NETWORKLEARNING


ANN learning is well-suited to problems in which the training data corresponds to
noisy, complex sensor data, such as inputs from cameras and microphones.

AI&ML 34 R.THANIGAIVEL
Neural network learning to steer an autonomous vehicle. The ALVINN system uses
BACKPROPAGATION to learn to steer an autonomous vehicle (photo at top) driving at
speeds up to 70 miles per hour. The diagram on the left shows how the image of a forward-
mounted camera is mapped to 960 neural network inputs, which are fed forward to 4 hidden
units, connected to 30 output units. Network outputs encode the commanded steering direction.
The figure on the right shows weight values for one of the hidden units in this network. The 30
x 32 weights into the hidden unit are displayed in the large matrix, with white blocks indicating
positive and black indicating negative weights. The weights from this hidden unit to the 30
output units are depicted by the smaller rectangular block directly above the large block. As
can be seen from these output weights, activation of this particular hidden unit encourages a
turn toward the left.

~t is also applicable to problems for which more symbolic representations are often
used, such as the decision tree learning tasks discussed in Chapter 3. In these cases ANN and
decision tree learning often produce results of comparable accuracy. See Shavlik et al. (1991)
and Weiss and Kapouleas (1989) for experimental comparisons of decision tree and ANN
learning. The BACKPROPAGATION algorithm is the most commonly used ANN learning
technique. It is appropriate for problems with the following characteristics:

Instances are represented by many attribute-value pairs. The target function to be learned is
defined over instances that can be described by a vector of predefined features, such as the

AI&ML 35 R.THANIGAIVEL
pixel values in the ALVINN example. These input attributes may be highly correlated or
independent of one another. Input values can be any real values.

The targetfunction output may be discrete-valued, real-valued, or a vector of several real- or


discrete-valued attributes. For example, in the ALVINN system the output is a vector of 30
attributes, each corresponding to a recommendation regarding the steering direction. The value
of each output is some real number between 0 and 1, which in this case corresponds to the
confidence in predicting the corresponding steering direction. We can also train a single
network to output both the steering command and suggested acceleration, simply by
concatenating the vectors that encode these two output predictions.

The training examples may contain errors. ANN learning methods are quite robust to noise
in the training data.

Long training times are acceptable. Network training algorithms typically require longer
training times than, say, decision tree learning algorithms. Training times can range from a few
seconds to many hours, depending on factors such as the number of weights in the network,
the number of training examples considered, and the settings of various learning algorithm
parameters.

Fast evaluation of the learned target function may be required. Although ANN learning times
are relatively long, evaluating the learned network, in order to apply it to a subsequent instance,
is typically very fast. For example, ALVINN applies its neural network several times per
second to continually update its steering command as the vehicle drives forward.

I The ability of humans to understand the learned target function is not important. The
weights learned by neural networks are often difficult for humans to interpret. Learned neural
networks are less easily communicated to humans than learned rules.

The rest of this chapter is organized as follows: We first consider several alternative
designs for the primitive units that make up artificial neural networks (perce~trons, linear units,
and sigmoid units), along with learning algorithms for training single units. We then present
the BACKPROPAGATION algorithm for training multilayer networks of such units and
consider several general issues such as the representational capabilities of ANNs, nature of the
hypothesis space search, overfitting problems, and alternatives to the BACKPROPAGATION
algorithm.

A detailed example is also presented applying BACKPROPAGATION to face


recognition, and directions are provided for the reader to obtain the data and code to experiment
further with this application.

Learning rule enhances the Artificial Neural Network’s performance by applying this
rule over the network. Thus learning rule updates the weights and bias levels of a network
when certain conditions are met in the training process. it is a crucial part of the development
of the Neural Network.
Types Of Learning Rules in ANN

AI&ML 36 R.THANIGAIVEL
1. Hebbian Learning Rule

Donald Hebb developed it in 1949 as an unsupervised learning algorithm in the neural


network. We can use it to improve the weights of nodes of a network. The following
phenomenon occurs when
 If two neighbor neurons are operating in the same phase at the same period of time, then
the weight between these neurons should increase.
 For neurons operating in the opposite phase, the weight between them should decrease.
 If there is no signal correlation, the weight does not change, the sign of the weight between
two nodes depends on the sign of the input between those nodes
 When inputs of both the nodes are either positive or negative, it results in a strong positive
weight.
 If the input of one node is positive and negative for the other, a strong negative weight is
present.
Mathematical Formulation:
δw=αxiy
where δw=change in weight,α is the learning rate.xi the input vector,y the output.

2. Perceptron Learning Rule

It was introduced by Rosenblatt. It is an error-correcting rule of a single-layer


feedforward network. it is supervised in nature and calculates the error between the desired
and actual output and if the output is present then only adjustments of weight are done.
Computed as follows:
Assume (x1,x2,x3……………………….xn) –>set of input vectors
and (w1,w2,w3…………………..wn) –>set of weights

AI&ML 37 R.THANIGAIVEL
y=actual output
wo=initial weight
wnew=new weight
δw=change in weight
α=learning rate
actual output(y)=wixi
learning signal(ej)=ti-y (difference between desired and actual output)
δw=αxiej
wnew=wo+δw
Now, the output can be calculated on the basis of the input and the activation function applied
over the net input and can be expressed as:
y=1, if net input>=θ
y=0, if net input<θ

3. Delta Learning Rule

It was developed by Bernard Widrow and Marcian Hoff and It depends on supervised
learning and has a continuous activation function. It is also known as the Least Mean Square
method and it minimizes error over all the training patterns.
It is based on a gradient descent approach which continues forever. It states that the
modification in the weight of a node is equal to the product of the error and the input where
the error is the difference between desired and actual output.
Computed as follows:

AI&ML 38 R.THANIGAIVEL
Assume (x1,x2,x3……………………….xn) –>set of input vectors
and (w1,w2,w3…………………..wn) –>set of weights
y=actual output
wo=initial weight
wnew=new weight
δw=change in weight
Error= ti-y
Learning signal(ej)=(ti-y)y’
y=f(net input)= ∫wixi
δw=αxiej=αxi(ti-y)y’
wnew=wo+δw
The updating of weights can only be done if there is a difference between the target and
actual output(i.e., error) present:
case I: when t=y
then there is no change in weight
case II: else
wnew=wo+δw

4. Correlation Learning Rule

The correlation learning rule follows the same similar principle as the Hebbian
learning rule,i.e., If two neighbor neurons are operating in the same phase at the same period
of time, then the weight between these neurons should be more positive. For neurons
operating in the opposite phase, the weight between them should be more negative but unlike
the Hebbian rule, the correlation rule is supervised in nature here, the targeted response is
used for the calculation of the change in weight.
In Mathematical form:
δw=αxitj
where δw=change in weight,α=learning rate,xi=set of the input vector, and tj=target value

AI&ML 39 R.THANIGAIVEL
5. Out Star Learning Rule

It was introduced by Grossberg and is a supervised training procedure.

Out Star Learning Rule is implemented when nodes in a network are arranged in a layer.
Here the weights linked to a particular node should be equal to the targeted outputs for the
nodes connected through those same weights. Weight change is thus calculated as=δw=α(t-
y)
Where α=learning rate, y=actual output, and t=desired output for n layer nodes.

6. Competitive Learning Rule

AI&ML 40 R.THANIGAIVEL
It is also known as the Winner-takes-All rule and is unsupervised in nature. Here all the
output nodes try to compete with each other to represent the input pattern and the winner is
declared according to the node having the most outputs and is given the output 1 while the
rest are given 0.
There are a set of neurons with arbitrarily distributed weights and the activation function is
applied to a subset of neurons. Only one neuron is active at a time. Only the winner has
updated weights, the rest remain unchanged

2.3 SINGLE LAYER PERCEPTRON CLASSIFIERS


PECEPTRON
One type of ANN system is based on a unit called a perceptron, . A perceptron takes a vector
of real-valued inputs, calculates a linear combination of these inputs, then outputs a 1 if the
result is greater than some threshold and -1 otherwise. More precisely, given inputs xl through
x,, the output o(x1,. ..,x,) computed by the perceptron is

where each wi is a real-valued constant,or weight, that determines the contribution of input xi
to the perceptron output. Notice the quantity (-wO) is a threshold that
the weighted combination of inputs wlxl +. . .+ wnxn must surpass in order for
the perceptron to output a

PERCEPTRON CLASSIFIER

A simple binary linear classifier called a perceptron generates predictions based on


the weighted average of the input data. Based on whether the weighted total exceeds a
predetermined threshold, a threshold function determines whether to output a 0 or a 1. One

AI&ML 41 R.THANIGAIVEL
of the earliest and most basic machine learning methods used for binary classification is the
perceptron. Frank Rosenblatt created it in the late 1950s, and it is a key component of more
intricate neural network topologies.

Single Layer Perceptron


Components of a Perceptron:

1. Input Features (x): Predictions are based on the characteristics or qualities of the input
data, or input features (x). A number value is used to represent each feature. The two
classes in binary classification are commonly represented by the numbers 0 (negative
class) and 1 (positive class).
2. Input Weights (w): Each input information has a weight (w), which establishes its
significance when formulating predictions. The weights are numerical numbers as well
and are either initialized to zeros or small random values.
3. Weighted Sum ( ): To calculate the weighted sum, use the dot product of the input
features’ (x) weights and their associated features’ (w) weights. Mathematically, it is

written as .

4. Activation Function (Step Function) : The activation function, which is commonly a


step function, is applied to the weighted sum ( ). If the weighted total exceeds a
predetermined threshold, the step function is utilized to decide the perceptron’s output.
The output is 1 (positive class) if is greater than or equal to the threshold and 0
(negative class) otherwise.

Working of the Perceptron:


1. Initialization: The weights (w) are initially initialized, frequently using tiny random
values or zeros.

AI&ML 42 R.THANIGAIVEL
2. Prediction: The Perceptron calculates the weighted total ( ) of the input features
and weights in order to provide a forecast for a particular input.
3. Activation Function: Following the computation of the weighted sum ( ), an
activation function is used. The perceptron outputs 1 (positive class) if is greater
than or equal to a specific threshold; otherwise, it outputs 0 (negative class) because the
activation function is a step function.
4. Updating Weight: Weights are updated if a misclassification, or an inaccurate prediction,
is made by the perceptron. The weight update is carried out to reduce prediction
inaccuracy in the future. Typically, the update rule involves shifting the weights in a way
that lowers the error. The perceptron learning rule, which is based on the discrepancy
between the expected and actual class labels, is the most widely used rule.
5. Repeat: Each input data point in the training dataset is repeated through steps 2 through
4 one more time. This procedure keeps going until the model converges and accurately
categorizes the training data, which could take a certain amount of iterations.

Steps required for classification using Perceptron:


There are various steps involved in performing classification using the Perceptron algorithm
in Scikit-Learn:
1. Data preparation: Preprocess and load your dataset. A training set and a testing set
should be separated.
2. Add Required Libraries: Import Scikit-Learn along with the other necessary libraries.
3. Perceptron Model Construction: Set hyperparameters like the learning rate and
maximum iterations when creating a Perceptron classifier.
4. Training of the Model: Fit the training set of data to the perceptron model.
5. Make predictions: On the basis of the testing data, use the trained model to make
predictions.
6. Model’s performance evaluation: Utilize metrics such as accuracy, precision, recall,
and F1-score to evaluate the model’s performance.
7. Visualize the outcomes (optional): The decision boundary and the data points can be
shown to help you see how the model categorizes cases.

STUDY LAB PROGRAM 3 FOR THIS TOPIC


ALGORITHM:
The provided code implements a basic perceptron algorithm. Here's a simplified breakdown:
Initialization: Randomly initializes weights and sets bias to zero.
Forward Pass: Given input data, computes the weighted sum of inputs plus bias.
Activation: Applies a step function (thresholding) to the weighted sum. If the sum is greater
than zero, it outputs 1; otherwise, it outputs 0.
Example Usage: Instantiates the perceptron with two input features and tests it with sample
input data.

AI&ML 43 R.THANIGAIVEL
This perceptron class serves as a simple binary classifier, but it's limited to linearly separable
problems. For more complex tasks, more advanced neural network architectures are typically
used.

PROGRAM
import tensorflow as tf
from tensorflow.compat.v1.losses import sparse_softmax_cross_entropy

class Perceptron(tf.Module):
def __init__(self, num_inputs):
super(Perceptron, self).__init__()
self.weights = tf.Variable(tf.random.normal(shape=(num_inputs, 1)))
self.bias = tf.Variable(tf.zeros(shape=(1,)))

def __call__(self, inputs):


weighted_sum = tf.matmul(inputs, self.weights) + self.bias
output = tf.where(weighted_sum > 0., 1., 0.)
return output

# Example usage
num_inputs = 2
perceptron = Perceptron(num_inputs)

# Sample input
input_data = tf.constant([[0., 0.], [0., 1.], [1., 0.], [1., 1.]], dtype=tf.float32)

# Test the perceptron


output = perceptron(input_data)
print("Output:", output.numpy()
)

AI&ML 44 R.THANIGAIVEL
OUTPUT:
[[0.]
[0.]
[0.]
[0.]]

2.4 BACKPROPAGATION NETWORK

The BACKPROPAGATION algorithm learns the weights for a multilayer network, given a
network with a fixed set of units and interconnections. It employs gradient descent to attempt
to minimize the squared error between the network output values and the target values for these
outputs. This section presents the BACKPROPAGATION algorithm, and the following section
gives the derivation for the gradient descent weight update rule used by
BACKPROPAGATION.

Because we are considering networks with multiple output units rather than single units
as before, we begin by redefining E to sum the errors over all of the network output units
where outputs is the set of output units in the network, and tkd and OM are the I target and
output values associatedwith the kth output unit and training exampled.

The learning problem faced by BACKPROPAGATION is to search a large hypothesis


space defined by all possible weight values for all the units in the network. The situation can
be visualized in terms of an error surface similar to that shown for linear units in Figure 4.4.
The error in that diagram is replaced by our new definition of E, and the other dimensions of
the space correspond now to all of the weights associated with all of the units in the network.
As in the case of training a single unit, gradient descent can be used to attempt to find a
hypothesis to minimize E.

B ~ c ~ ~ ~ o ~ ~ G A T I O ~ ( t r a i n i n g a x a m p ~ e s , q, ni, ,no,, ,nhidden)


Each training example is a pair of theform (2,i ), where x' is the vector of network input
values, and is the vector of target network output values. q is the learning rate (e.g., .O5). ni,
is the number of network inputs, nhidden the number of units in the hidden layer, and no,, the
number of output units. The inputfiom unit i into unit j is denoted xji, and the weight from unit
i to unit j is denoted wji. a Create a feed-forward network with ni, inputs, m i d d e n hidden
units, and nouroutput units. a Initialize all network weights to small random numbers (e.g.,
between -.05and .05). r Until the termination condition is met, Do a For each (2,i ) in
trainingaxamples, Do Propagate the inputforward through the network:
1, Input the instance x' to the network and compute the output o, of every unit u in the
network.
Propagate the errors backward through the network:
2. For each network output unit k, calculate its error term Sk
6k 4- ok(l - ok)(tk- 0 k )

AI&ML 45 R.THANIGAIVEL
3. For each hidden unit h, calculate its error term 6h
4. Update each network weight wji
where
Aw..-
Jl - I 11
TABLE 4.2 The stochasticgradient descent version of the BACKPROPAGATION algorithm
for feedforward networks containing two layers of sigmoid units.
One major differencein the case of multilayer networks is that the error surface can have
multiple local minima, in contrast to the single-minimum parabolic error surface shown in
Figure 4.4. Unfortunately, this means that gradient descent is guaranteed only to converge
toward some local minimum, and not necessarily the global minimum error. Despite this
obstacle, in practice BACKPROPAGATION has been found to produce excellent results in
many real-world applications. The BACKPROPAGATION algorithm is presented in Table
4.2. The algorithm as described here applies to layered feedforward networks containing two
layers of sigmoid units, with units at each layer connected to all units from the preceding layer.
This is the incremental, or stochastic, gradient descent version of BACKPROPAGATION. The
notation used here is the same as that used in earlier sections, with the following extensions:

An index (e.g., an integer) is assigned to each node in the network,where


a "node" is either an input to the network or the output of some unit in the
network.
0 xji denotes the input from node i to unit j , and wji denotes the corresponding
weight.
0 6, denotes the error term associated with unit n. It plays a role analogous
to the quantity (t-o ) in our earlier discussion of the delta training rule. As
we shall see later, 6, =-s.
Notice the algorithm in Table 4.2 begins by constructing a network with the desired number of
hidden and output units and initializing all network weights to small random values. Given this
fixed network structure, the main loop of the algorithm then repeatedly iterates over the training
examples. For each training example, it applies the network to the example, calculates the error
of the network output for this example, computes the gradient with respect to the error on this
example, then updates all weights in the network. This gradient descent step is iterated (often
thousands of times, using the same training examples multiple times) until the network
performs acceptably well.
The gradient descent weight-update rule (Equation [T4.5] in Table 4.2) is similar to the delta
training rule (Equation [4.10]). Like the delta rule, it updates each weight in proportion to the
learning rate r ] , the input value xji to which the weight is applied, and the error in the output
of the unit. The only difference is that the error (t - o ) in the delta rule is replaced by a more
complex
error term, aj. The exact form of aj follows from the derivation of the weighttuning rule given
in Section 4.5.3. To understand it intuitively, first consider
how ak is computed for each network output unit k (Equation [T4.3] in the algorithm). ak is
simply the familiar (tk- ok) from the delta rule, multiplied by
the factor ok(l- ok),which is the derivative of the sigmoid squashing function.
The ah value for each hidden unit h has a similar form (Equation [T4.4] in the

AI&ML 46 R.THANIGAIVEL
algorithm). However, since training examples provide target values tk only for network
outputs, no target values are directly available to indicate the error of hidden units' values.
Instead, the error term for hidden unit h is calculated by summing the error terms J k for each
output unit influenced by h, weighting each of the ak's by wkh,the weight from hidden unit h
to output unit k. This weight characterizes the degree to which hidden unit h is "responsible
for" the error in output unit k.
I The algorithm in Table 4.2 updates weights incrementally, following the
I Presentation of each training example. This corresponds to a stochastic approximation to
gradient descent. To obtain the true gradient of E one would sum the 6, x,, values over all
training examples before altering weight values.

The weight-update loop in BACKPROPAGATION may be iterated thousands of times


in a typical application. A variety of termination conditions can be used to halt the procedure.
One may choose to halt after a fixed number of iterations through the loop, or once the error
on the training examples falls below some threshold, or once the error on a separate validation
set of examples meets some
criterion. The choice of termination criterion is an important one, because too few iterations
can fail to reduce error sufficiently, and too many can lead to overfitting the training data. This
issue is discussed in greater detail in Section.

ADDING MOMENTUM

Because BACKPROPAGATION is such a widely used algorithm, many variations


have been developed. Perhaps the most common is to alter the weight-update rule in Equation
(T4.5) in the algorithmby making the weight update on the nth iteration depend partially on the
update that occurred during the (n- 1)th iteration, as follows:

Here Awji(n) is the weight update performed during the nth iteration through the main
loop of the algorithm, and 0 5 a < 1 is a constant called the momentum. Notice the first term
on the right of this equation is just the weight-update rule of Equation (T4.5) in the
BACKPROPAGATION algorithm. The second term on the right is new and is called the
momentum term. To see the effect of this momentum term, consider that the gradient descent
search trajectory is analogous to that of a (momentumless) ball rolling down the error surface.
The effect of a! is to add momentum that tends to keep the ball rolling in the same direction
from one iteration to the next. This can sometimes have the effect of keeping the ball rolling
through small local minima in the error surface, or along flat regions in the surface where the
ball would stop if there were no momentum. It also has the effect of gradually increasing the
step size of the search in regions where the gradient is unchanging, thereby speeding
convergence.

4.5.2.2 LEARNING IN ARBITRARY ACYCLIC NETWORKS

The definition of BACKPROPAGATION presented in Table 4.2 applies o h y to


twolayer networks. However, the algorithm given there easily generalizes to feedforward
networks of arbitrary depth. The weight update rule seen in Equation (T4.5)
is retained, and the only change is to the procedure for computing 6 values. In

AI&ML 47 R.THANIGAIVEL
general, the 6, value for a unit r in layer rn is computed from the 6 values at the
next deeper layer rn + 1 according to
Notice this is identical to Step 3 in the algorithm of Table 4.2, so all we are really saying here
is that this step may be repeated for any number of hidden layers in the network.
It is equally straightforward to generalize the algorithm to any directed acyclic graph,
regardless of whether the network units are arranged in uniform layers as we have assumed up
to now. In the case that they are not, the rule for calculating 6 for any internal unit (i.e., any
unit that is not an output) is
where Downstream(r) is the set of units immediately downstream from unit r in the network:
that is, all units whose inputs include the output of unit r. It is this gneral form of the weight-
update rule that we derive in Section

4.5.3 DERIVATION OF THE BACKPROPAGATION RULE

This section presents the derivation of the BACKPROPAGATION weight-tuning rule. It


may be skipped on a first reading, without loss of continuity.
The specific problem we address here is deriving the stochastic gradient descent rule
implementedby the algorithm in Table 4.2. Recall from Equation (4.l l ) that stochastic gradient
descent involves iterating through the training examples one at a time, for each training
example d descending the gradient of the error Ed with respect to this single example. In other
words, for each training example d every weight wji is updated by adding to it Awji
where Ed is the error on training example d, summed over all output units in the network
Here outputs is the set of output units in the network, tk is the target value of unit k for training
example d, and ok is the output of unit k given training example d.
The derivationof the stochastic gradient descent rule is conceptually straightforward, but
requires keeping track of a number of subscripts and variables. We will follow the notation
shown in Figure 4.6, adding a subscript j to denote to the jth unit of the network as follows:
xji = the ith input to unit j
wji = the weight associated with the ith input to unit j
netj =xiwjixji (the weighted sum of inputs for unit j )
oj = the output computed by unit j t, = the target output for unit j a =the sigmoid function
outputs = the set of units in the final layer of the network Downstream(j) = the set of units
whose immediate inputs include the output of unit j
We now derive an expression for 2 in order to implement the stochastic
gradient descent rule seen in Equation (4:2l).To begin, notice that weight wji can influence the
rest of the network only through netj. Therefore, we can use the
102 MACHINE LEARNING
chain rule to write
Given Equation (4.22), our remaining task is to derive a convenient expression for z.We
consider two cases in turn: the case where unit j is an output unit
for the network, and the case where j is an internal unit.
Case 1: raini in^ Rule for Output Unit Weights.
Just as wji can influence the
rest of the network only through net,, net, can influence the network only through o j .
Therefore, we can invoke the chain rule again to write
To begin, consider just the first term in Equation (4.23)

AI&ML 48 R.THANIGAIVEL
The derivatives &(tk -ok12 will be zero for all output units k except when k = j. We therefore
drop the summation over output units and simply set k = j.
Next consider the second term in Equation (4.23). Since oj = a(netj),the
derivative $ is just the derivative of the sigmoid function, which we have
already noted is equal to a(netj)(l- a(netj)).Therefore,
Substituting expressions (4.24) and (4.25) into (4.23), we obtain
and combining this with Equations (4.21) and (4.22), we have the stochastic gradient descent
rule for output units
Note this training rule is exactly the weight update rule implemented by Equations (T4.3) and
(T4.5) in the algorithm of Table 4.2. Furthermore, we can see
now that Sk in Equation (T4.3) is equal to the quantity -$. In the remainder
of this section we will use Si to denote the quantity -% for an arbitrary unit i .

Case 2: Training Rule for Hidden Unit Weights. In the case where j is an internal, or hidden
unit in the network, the derivation of the training rule for wji must take into account the indirect
ways in which wji can influence the network outputs and hence Ed. For this reason, we will
find it useful to refer to the set of all units immediately downstream of unit j in the network
(i.e., all units whose direct inputs include the output of unit j). We denote this set of units by
Downstream( j). Notice that netj can influence the network outputs (and therefore E d ) only
through the units in Downstream(j). Therefore, we can write
Rearranging terms and using S j to denote-$,we have
and
which is precisely the general rule from Equation (4.20) for updating internal unit weights in
arbitrary acyclic directed graphs. Notice Equation (T4.4) from Table 4.2 is just a special case
of this rule, in which Downstream(j)= outputs.

2.5 GENERALIZED DELTA RULE:

he delta rule in an artificial neural network is a specific kind of backpropagation that assists in
refining the machine learning/artificial intelligence network, making associations among input
and outputs with different layers of artificial neurons. The Delta rule is also called the Delta
learning rule.

Generally, backpropagation has to do with recalculating input weights for artificial neurons
utilizing a gradient technique. Delta learning does this by using the difference between a target
activation and an obtained activation. By using a linear activation function, network
connections are balanced. Another approach to explain the Delta rule is that it uses an error
function to perform gradient descent learning.

Delta rule refers to the comparison of actual output with a target output, the technology tries to
discover the match, and the program makes changes. The actual execution of the Delta rule
will fluctuate as per the network and its composition. Still, by applying a linear activation
function, the delta rule can be useful in refining a few sorts of neural networks with specific
kinds of backpropagation.

AI&ML 49 R.THANIGAIVEL
Delta rule is introduced by Widrow and Hoff, which is the most significant learning rule that
depends on supervised learning.

This rule states that the change in the weight of a node is equivalent to the product of error and
the input.

Mathematical equation:

The given equation gives the mathematical equation for delta learning rule:

∆w = µ.x.z

∆w = µ(t-y)x

Here,

∆w = weight change.

µ = the constant and positive learning rate.

X = the input value from pre-synaptic neuron.

z= (t-y) is the difference between the desired input t and the actual output y. The above
mentioned mathematical rule cab be used only for a single output unit.

The different weights can be determined with respect to these two cases.

Case 1 - When t ≠ k, then

w(new) = w(old) + ∆w

Case 2 - When t = k, then

No change in weight

For a given input vector, we need to compare the output vector, and the final output vector
would be the correct answer. If the difference is zero, then no learning takes place, so we need
to adjust the weight to reduce the difference. If the set of input patterns is taken from an
independent set, then it uses learn arbitrary connections using the delta learning rule. It has
examined for networks with linear activation function with no hidden units. The error squared
versus the weight graph is a paraboloid shape in n-space. The proportionality constant is
negative, so the graph of such a function is concave upward with the least value. The vertex of
the paraboloid represents the point where it decreases the error. The weight vector is comparing
this point with the ideal weight vector. We can utilize the delta learning rule with both single
output units and numerous output units. When we are applying the delta learning rule is to
diminish the difference between the actual and probable output, we find an error.

AI&ML 50 R.THANIGAIVEL
2.6 ASSOCIATIVE MEMORY

An associate memory network refers to a content addressable memory structure that associates
a relationship between the set of input patterns and output patterns. A content addressable
memory structure is a kind of memory structure that enables the recollection of data based on
the intensity of similarity between the input pattern and the patterns stored in the memory.

Let's understand this concept with an example:

The figure given below illustrates a memory containing the names of various people. If the
given memory is content addressable, the incorrect string "Albert Einstein" as a key is
sufficient to recover the correct name "Albert Einstein."

In this condition, this type of memory is robust and fault-tolerant because of this type of
memory model, and some form of error-correction capability.

Note: An associate memory is obtained by its content, adjacent to an explicit address in the
traditional computer memory system. The memory enables the recollection of information
based on incomplete knowledge of its contents.

There are two types of associate memory- an auto-associative memory and hetero associative
memory.

Auto-associative memory:

An auto-associative memory recovers a previously stored pattern that most closely relates to
the current pattern. It is also known as an auto-associative correlator.

AI&ML 51 R.THANIGAIVEL
Consider x[1], x[2], x[3],….. x[M], be the number of stored pattern vectors, and let x[m] be
the element of these vectors, showing characteristics obtained from the patterns. The auto-
associative memory will result in a pattern vector x[m] when putting a noisy or incomplete
version of x[m].

Hetero-associative memory:

In a hetero-associate memory, the recovered pattern is generally different from the input pattern
not only in type and format but also in content. It is also known as a hetero-
associative correlator.

Consider we have a number of key response pairs {a(1), x(1)}, {a(2),x(2)},…..,{a(M), x(M)}.
The hetero-associative memory will give a pattern vector x(m) when a noisy or incomplete
version of the a(m) is given.

Neural networks are usually used to implement these associative memory models called neural
associative memory (NAM). The linear associate is the easiest artificial neural associative
memory.

ADVERTISEMENT

AI&ML 52 R.THANIGAIVEL
These models follow distinct neural network architecture to memorize data.

Working of Associative Memory:

Associative memory is a depository of associated pattern which in some form. If the depository
is triggered with a pattern, the associated pattern pair appear at the output. The input could be
an exact or partial representation of a stored pattern.

If the memory is produced with an input pattern, may say α, the associated pattern ω is
recovered automatically.

These are the terms which are related to the Associative memory network:

Encoding or memorization:

Encoding or memorization refers to building an associative memory. It implies constructing an


association weight matrix w such that when an input pattern is given, the stored pattern
connected with the input pattern is recovered.

(Wij)k = (pi)k (qj)k

Where,

(Pi)k represents the ith component of pattern pk, and

(qj)k represents the jth component of pattern qk

Where,

AI&ML 53 R.THANIGAIVEL
strong>i = 1,2, …,m and j = 1,2,…,n.

Constructing the association weight matrix w is accomplished by adding the individual


correlation matrices wk , i.e.,

ADVERTISEMENT

Where α = Constructing constant.

Errors and noise:

The input pattern may hold errors and noise or may contain an incomplete version of some
previously encoded pattern. If a corrupted input pattern is presented, the network will recover
the stored Pattern that is adjacent to the actual input pattern. The existence of noise or errors
results only in an absolute decrease rather than total degradation in the efficiency of the
network. Thus, associative memories are robust and error-free because of many processing
units performing highly parallel and distributed computations.

Performance Measures:

The measures taken for the associative memory performance to correct recovery are memory
capacity and content addressability. Memory capacity can be defined as the maximum
number of associated pattern pairs that can be stored and correctly recovered. Content-
addressability refers to the ability of the network to recover the correct stored pattern.

If input patterns are mutually orthogonal, perfect recovery is possible. If stored input patterns
are not mutually orthogonal, non-perfect recovery can happen due to intersection among the
patterns.

Associative memory models:

Linear associator is the simplest and most widely used associative memory models. It is a
collection of simple processing units which have a quite complex collective computational
capability and behavior. The Hopfield model computes its output that returns in time until the
system becomes stable. Hopfield networks are constructed using bipolar units and a learning
process. The Hopfield model is an auto-associative memory suggested by John
Hopfield in 1982. Bidirectional Associative Memory (BAM) and the Hopfield model are
some other popular artificial neural network models used as associative memories.

Network architectures of Associate Memory Models:

The neural associative memory models pursue various neural network architectures to
memorize data. The network comprises either a single layer or two layers. The linear associator

AI&ML 54 R.THANIGAIVEL
model refers to a feed-forward type network, comprises of two layers of different processing
units- The first layer serving as the input layer while the other layer as an output layer. The
Hopfield model refers to a single layer of processing elements where each unit is associated
with every other unit in the given network. The bidirectional associative memory
(BAM) model is the same as the linear associator, but the associations are bidirectional.

The neural network architectures of these given models and the structure of the
corresponding association weight matrix w of the associative memory are depicted.

Linear Associator model (two layers):

The linear associator model is a feed-forward type network where produced output is
in the form of single feed-forward computation. The model comprises of two layers of
processing units, one work as an input layer while the other work as an output layer. The input
is directly associated with the outputs, through a series of weights. The connections carrying
weights link each input to every output. The addition of the products of the weights and the
input is determined in each neuron node. The architecture of the linear associator is given
below.

All p inputs units are associated to all q output units via associated weight matrix

W = [wij]p * q where wij describes the strength of the unidirectional association of the ith input
unit to the jth output unit.

AI&ML 55 R.THANIGAIVEL
The connection weight matrix stores the z different associated pattern pairs {(Xk,Yk); k=
1,2,3,…,z}. Constructing an associative memory is building the connection weight
matrix w such that if an input pattern is presented, the stored pattern associated with the input
pattern is recovered.

2.7Adaptive Resonance Theory

The Adaptive Resonance Theory (ART) was incorporated as a hypothesis for human
cognitive data handling. The hypothesis has prompted neural models for pattern recognition
and unsupervised learning. ART system has been utilized to clarify different types of cognitive
and brain data.

The Adaptive Resonance Theory addresses the stability-plasticity(stability can be


defined as the nature of memorizing the learning and plasticity refers to the fact that they are
flexible to gain new information) dilemma of a system that asks how learning can proceed in
response to huge input patterns and simultaneously not to lose the stability for irrelevant
patterns. Other than that, the stability-elasticity dilemma is concerned about how a system can
adapt new data while keeping what was learned before. For such a task, a feedback mechanism
is included among the ART neural network layers. In this neural network, the data in the form
of processing elements output reflects back and ahead among layers. If an appropriate pattern
is build-up, the resonance is reached, then adaption can occur during this period.

It can be defined as the formal analysis of how to overcome the learning instability
accomplished by a competitive learning model, let to the presentation of an expended
hypothesis, called adaptive resonance theory (ART). This formal investigation indicated that
a specific type of top-down learned feedback and matching mechanism could significantly
overcome the instability issue. It was understood that top-down attentional mechanisms, which
had prior been found through an investigation of connections among cognitive and
reinforcement mechanisms, had similar characteristics as these code-stabilizing mechanisms.
In other words, once it was perceived how to solve the instability issue formally, it also turned
out to be certain that one did not need to develop any quantitatively new mechanism to do so.
One only needed to make sure to incorporate previously discovered attentional mechanisms.
These additional mechanisms empower code learning to self- stabilize in response to an
essentially arbitrary input system. Grossberg presented the basic principles of the adaptive
resonance theory. A category of ART called ART1 has been described as an arrangement of
ordinary differential equations by carpenter and Grossberg. These theorems can predict both
the order of search as the function of the learning history of the system and the input patterns.

ART1 is an unsupervised learning model primarily designed for recognizing binary


patterns. It comprises an attentional subsystem, an orienting subsystem, a vigilance parameter,
and a reset module, as given in the figure given below. The vigilance parameter has a huge
effect on the system. High vigilance produces higher detailed memories. The ART1 attentional
comprises of two competitive networks, comparison field layer L1 and the recognition field
layer L2, two control gains, Gain1 and Gain2, and two short-term memory (STM) stages S1
and S2. Long term memory (LTM) follows somewhere in the range of S1 and S2 multiply the
signal in these pathways.

AI&ML 56 R.THANIGAIVEL
Gains control empowers L1 and L2 to recognize the current stages of the running cycle.
STM reset wave prevents active L2 cells when mismatches between bottom-up and top-down
signals happen at L1. The comparison layer gets the binary external input passing it to the
recognition layer liable for coordinating it to a classification category. This outcome is given
back to the comparison layer to find out when the category coordinates the input vector.

If there is a match, then a new input vector is read, and the cycle begins once again. If
there is a mismatch, then the orienting system is in charge of preventing the previous category
from getting a new category match in the recognition layer. The given two gains control the
activity of the recognition and the comparison layer, respectively. The reset wave specifically
and enduringly prevents active L2 cell until the current is stopped. The offset of the input
pattern ends its processing L1 and triggers the offset of Gain2. Gain2 offset causes consistent
decay of STM at L2 and thereby prepares L2 to encode the next input pattern without bais.

AI&ML 57 R.THANIGAIVEL
ART1 Implementation process:

ART1 is a self-organizing neural network having input and output neurons mutually couple
using bottom-up and top-down adaptive weights that perform recognition. To start our
methodology, the system is first trained as per the adaptive resonance theory by inputting
reference pattern data under the type of 5*5 matrix into the neurons for clustering within the
output neurons. Next, the maximum number of nodes in L2 is defined following by the
vigilance parameter. The inputted pattern enrolled itself as short term memory activity over a
field of nodes L1. Combining and separating pathways from L1 to coding field L2, each
weighted by an adaptive long-term memory track, transform into a net signal vector T. Internal
competitive dynamics at L2 further transform T, creating a compressed code or content
addressable memory. With strong competition, activation is concentrated at the L2 node that
gets the maximal L1 → L2 signal. The primary objective of this work is divided into four
phases as follows Comparision, recognition, search, and learning.

Advantage of adaptive learning theory(ART):

It can be coordinated and utilized with different techniques to give more precise outcomes.

It doesn't ensure stability in forming clusters.

It can be used in different fields such as face recognition, embedded system, and robotics, target
recognition, medical diagnosis, signature verification, etc.

It shows stability and is not disturbed by a wide range of inputs provided to inputs.

It has got benefits over competitive learning. The competitive learning cant include new
clusters when considered necessary.

Application of ART:

AI&ML 58 R.THANIGAIVEL
ART stands for Adaptive Resonance Theory. ART neural networks used for fast, stable
learning and prediction have been applied in different areas. The application incorporates target
recognition, face recognition, medical diagnosis, signature verification, mobile control robot.

Target recognition:

Fuzzy ARTMAP neural network can be used for automatic classification of targets
depend on their radar range profiles. Tests on synthetic data show the fuzzy ARTMAP can
result in substantial savings in memory requirements when related to k nearest neighbor(kNN)
classifiers. The utilization of multiwavelength profiles mainly improves the performance of
both kinds of classifiers.

Medical diagnosis:

Medical databases present huge numbers of challenges found in general information


management settings where speed, use, efficiency, and accuracy are the prime concerns. A
direct objective of improved computer-assisted medicine is to help to deliver intensive care in
situations that may be less than ideal. Working with these issues has stimulated several ART
architecture developments, including ARTMAP-IC.

Signature verification:

Automatic signature verification is a well known and active area of research with
various applications such as bank check confirmation, ATM access, etc. the training of the
network is finished using ART1 that uses global features as input vector and the verification
and recognition phase uses a two-step process. In the initial step, the input vector is coordinated
with the stored reference vector, which was used as a training set, and in the second step, cluster
formation takes place.

Mobile control robot:

AI&ML 59 R.THANIGAIVEL
Nowadays, we perceive a wide range of robotic devices. It is still a field of research in
their program part, called artificial intelligence. The human brain is an interesting subject as a
model for such an intelligent system. Inspired by the structure of the human brain, an artificial
neural emerges. Similar to the brain, the artificial neural network contains numerous simple
computational units, neurons that are interconnected mutually to allow the transfer of the signal
from the neurons to neurons. Artificial neural networks are used to solve different issues with
good outcomes compared to other decision algorithms.

Limitations of ART:

Some ART networks are contradictory as they rely on the order of the training data, or upon
the learning rate.

AI&ML 60 R.THANIGAIVEL
UNIT -3

FUZZY LOGIC SYSTEM


3.1 BASICS OF FUZZY LOGIC THEORY

The term fuzzy refers to things that are not clear or are vague. In the real world many
times we encounter a situation when we can’t determine whether the state is true or false,
their fuzzy logic provides very valuable flexibility for reasoning. In this way, we can consider
the inaccuracies and uncertainties of any situation.

Fuzzy Logic is a form of many-valued logic in which the truth values of variables
may be any real number between 0 and 1, instead of just the traditional values of true or false.
It is used to deal with imprecise or uncertain information and is a mathematical method for
representing vagueness and uncertainty in decision-making.

Fuzzy Logic is based on the idea that in many cases, the concept of true or false is too
restrictive, and that there are many shades of gray in between. It allows for partial truths,
where a statement can be partially true or false, rather than fully true or false.

Fuzzy Logic is used in a wide range of applications, such as control systems, image
processing, natural language processing, medical diagnosis, and artificial intelligence.

The fundamental concept of Fuzzy Logic is the membership function, which defines the
degree of membership of an input value to a certain set or category. The membership function
is a mapping from an input value to a membership degree between 0 and 1, where 0 represents
non-membership and 1 represents full membership.

Fuzzy Logic is implemented using Fuzzy Rules, which are if-then statements that
express the relationship between input variables and output variables in a fuzzy way. The
output of a Fuzzy Logic system is a fuzzy set, which is a set of membership degrees for each
possible output value.

In summary, Fuzzy Logic is a mathematical method for representing vagueness and


uncertainty in decision-making, it allows for partial truths, and it is used in a wide range of
applications. It is based on the concept of membership function and the implementation is
done using Fuzzy rules.

In the boolean system truth value, 1.0 represents the absolute truth value and 0.0
represents the absolute false value. But in the fuzzy system, there is no logic for the absolute
truth and absolute false value. But in fuzzy logic, there is an intermediate value too present
which is partially true and partially false.

AI&ML 61 R.THANIGAIVEL
ARCHITECTURE

Its Architecture contains four parts :

 RULE BASE: It contains the set of rules and the IF-THEN conditions provided by the
experts to govern the decision-making system, on the basis of linguistic information.
Recent developments in fuzzy theory offer several effective methods for the design and
tuning of fuzzy controllers. Most of these developments reduce the number of fuzzy rules.

 FUZZIFICATION: It is used to convert inputs i.e. crisp numbers into fuzzy sets. Crisp
inputs are basically the exact inputs measured by sensors and passed into the control
system for processing, such as temperature, pressure, rpm’s, etc.

 INFERENCE ENGINE: It determines the matching degree of the current fuzzy input with
respect to each rule and decides which rules are to be fired according to the input field.
Next, the fired rules are combined to form the control actions.

 DEFUZZIFICATION: It is used to convert the fuzzy sets obtained by the inference


engine into a crisp value. There are several defuzzification methods available and the
best-suited one is used with a specific expert system to reduce the error.

AI&ML 62 R.THANIGAIVEL
Membership function

Definition: A graph that defines how each point in the input space is mapped to membership
value between 0 and 1. Input space is often referred to as the universe of discourse or
universal set (u), which contains all the possible elements of concern in each particular
application.

There are largely three types of fuzzifiers:

 Singleton fuzzifier

 Gaussian fuzzifier

 Trapezoidal or triangular fuzzifier

What is Fuzzy Control?

 It is a technique to embody human-like thinkings into a control system.

 It may not be designed to give accurate reasoning but it is designed to give acceptable
reasoning.

 It can emulate human deductive thinking, that is, the process people use to infer
conclusions from what they know.

 Any uncertainties can be easily dealt with the help of fuzzy logic.

Advantages of Fuzzy Logic System

 This system can work with any type of inputs whether it is imprecise, distorted or noisy
input information.

 The construction of Fuzzy Logic Systems is easy and understandable.

AI&ML 63 R.THANIGAIVEL
 Fuzzy logic comes with mathematical concepts of set theory and the reasoning of that is
quite simple.

 It provides a very efficient solution to complex problems in all fields of life as it resembles
human reasoning and decision-making.

 The algorithms can be described with little data, so little memory is required.

Disadvantages of Fuzzy Logic Systems

 Many researchers proposed different ways to solve a given problem through fuzzy logic
which leads to ambiguity. There is no systematic approach to solve a given problem
through fuzzy logic.

 Proof of its characteristics is difficult or impossible in most cases because every time we
do not get a mathematical description of our approach.

 As fuzzy logic works on precise as well as imprecise data so most of the time accuracy
is compromised.

Application

 It is used in the aerospace field for altitude control of spacecraft and satellites.

 It has been used in the automotive system for speed control, traffic control.

 It is used for decision-making support systems and personal evaluation in the large
company business.

 It has application in the chemical industry for controlling the pH, drying, chemical
distillation process.

 Fuzzy logic is used in Natural language processing and various intensive applications in
Artificial Intelligence.

 Fuzzy logic is extensively used in modern control systems such as expert systems.

 Fuzzy Logic is used with Neural Networks as it mimics how a person would make
decisions, only much faster. It is done by Aggregation of data and changing it into more
meaningful data by forming partial truths as Fuzzy sets.

3.2 CRISP SETS AND FUZZY SETS

Set: A set is defined as a collection of objects, which share certain characteristics.

Classical set

AI&ML 64 R.THANIGAIVEL
1. Classical set is a collection of distinct objects. For example, a set of students passing
grades.

2. Each individual entity in a set is called a member or an element of the set.

3. The classical set is defined in such a way that the universe of discourse is splitted into two
groups members and non-members. Hence, In case classical sets, no partial
membership exists.

4. Let A is a given set. The membership function can be use to define a set A is given by:

1. Operations on classical sets: For two sets A and B and Universe X:

 Union:

This operation is also called logical OR.

 Intersection:

 This operation is also called logical AND.

 Complement:

 Difference:

1. Properties of classical sets: For two sets A and B and Universe X:

 Commutativity:

AI&ML 65 R.THANIGAIVEL
 Associativity:

 Distributivity:

 Idempotency:

 Identity:

 Transitivity:

AI&ML 66 R.THANIGAIVEL
Fuzzy set:

1. Fuzzy set is a set having degrees of membership between 1 and 0. Fuzzy sets are
represented with tilde character(~). For example, Number of cars following traffic signals
at a particular time out of all cars present will have membership value between [0,1].

2. Partial membership exists when member of one fuzzy set can also be a part of other fuzzy
sets in the same universe.

3. The degree of membership or truth is not same as probability, fuzzy truth represents
membership in vaguely defined sets.

4. A fuzzy set A~ in the universe of discourse, U, can be defined as a set of ordered pairs
and it is given by

1. When the universe of discourse, U, is discrete and finite, fuzzy set A~ is given by

1. Fuzzy sets also satisfy every property of classical sets.

2. Common Operations on fuzzy sets: Given two Fuzzy sets A~ and B~

 Union : Fuzzy set C~ is union of Fuzzy sets A~ and B~ :

AI&ML 67 R.THANIGAIVEL

 Intersection: Fuzzy set D~ is intersection of Fuzzy sets A~ and B~ :

 Complement: Fuzzy set E~ is complement of Fuzzy set A~ :

1. Some other useful operations on Fuzzy set:

 Algebraicsum:

 Algebraicproduct:

AI&ML 68 R.THANIGAIVEL
 Boundedsum:

 Boundeddifference:

Difference Between Crisp Set and Fuzzy Set

Crisp Set: Countability and finiteness are identical properties which are the collection objects
of crisp set. ‘X‘ is a crisp set defined as the group of elements present over the universal set
i.e. U. In this case a random element is present that may be a part of X or not that means two
ways are possible to define the set. These are first element would become from set X, or it does
not come from X.

Fuzzy Set: The Integration of the elements having a changing degree of membership in the set
is called as fuzzy set. The word “fuzzy” indicates vagueness, On the other hand, we can say
that the replacement among various degrees of the membership implies that the vague and
ambiguity of the fuzzy set. Hence, the measurement of the membership of the elements from
the universe in the set against a function for detecting the uncertainty and ambiguity.

S.No Crisp Set Fuzzy Set

1 Crisp set defines the value is either 0 Fuzzy set defines the value between 0 and 1
or 1. including both 0 and 1.

2 It specifies the degree to which something is


It is also called a classical set.
true.

3 It shows full membership It shows partial membership.

4 Eg1. She is 18 years old. Eg1. She is about 18 years old.

AI&ML 69 R.THANIGAIVEL
Eg2. Rahul is 1.6m tall Eg2. Rahul is about 1.6m tall.

5 Crisp set application used for digital


Fuzzy set used in the fuzzy controller.
design.

6 It is bi-valued function logic. It is infinite valued function logic

7 Full membership means totally Partial membership means true to false, yes
true/false, yes/no, 0/1. to no, 0 to 1.

Crisp Set

Fuzzy Set

Fuzzy sets follow the same properties as crisp sets. Because of this fact and because the
membership values of a crisp set are a subset of the interval[0,1].

AI&ML 70 R.THANIGAIVEL
3.3 BASIC SET OF OPERATIONS

3.3.1 UNION SET

The union of two sets is a set containing all elements that are in A𝐴 or in B𝐵 (possibly both).
For example, {1,2}∪{2,3}={1,2,3}{1,2}∪{2,3}={1,2,3}. Thus, we can
write x∈(A∪B)𝑥∈(𝐴∪𝐵) if and only if (x∈A)(𝑥∈𝐴) or (x∈B)(𝑥∈𝐵). Note
that A∪B=B∪A𝐴∪𝐵=𝐵∪𝐴. In Figure 1.4, the union of sets A𝐴 and B𝐵 is shown by the shaded
area in the Venn diagram.

Fig.1.4 - The shaded area shows the set B∪A𝐵∪𝐴.

Similarly we can define the union of three or more sets. In particular,


if A1,A2,A3,⋯,An𝐴1,𝐴2,𝐴3,⋯,𝐴𝑛 are n𝑛 sets, their
union A1∪A2∪A3⋯∪An𝐴1∪𝐴2∪𝐴3⋯∪𝐴𝑛 is a set containing all elements that are in at least
one of the sets. We can write this union more compactly by

⋃i=1nAi.⋃𝑖=1𝑛𝐴𝑖.

For example, if A1={a,b,c},A2={c,h},A3={a,d}𝐴1={𝑎,𝑏,𝑐},𝐴2={𝑐,ℎ},𝐴3={𝑎,𝑑},


then ⋃iAi=A1∪A2∪A3={a,b,c,h,d}⋃𝑖𝐴𝑖=𝐴1∪𝐴2∪𝐴3={𝑎,𝑏,𝑐,ℎ,𝑑}. We can similarly define
the union of infinitely many sets A1∪A2∪A3∪⋯𝐴1∪𝐴2∪𝐴3∪⋯.

3.4 INTERSECTION SET

The intersection of two sets A𝐴 and B𝐵, denoted by A∩B𝐴∩𝐵, consists of all elements that
are both in A𝐴 and−−−and_ B𝐵. For example, {1,2}∩{2,3}={2}{1,2}∩{2,3}={2}. In Figure
1.5, the intersection of sets A𝐴 and B𝐵 is shown by the shaded area using a Venn diagram.

Fig.1.5 - The shaded area shows the set B∩A𝐵∩𝐴.

AI&ML 71 R.THANIGAIVEL
More generally, for sets A1,A2,A3,⋯𝐴1,𝐴2,𝐴3,⋯, their intersection ⋂iAi⋂𝑖𝐴𝑖 is defined as
the set consisting of the elements that are in all Ai𝐴𝑖's. Figure 1.6 shows the intersection of
three sets.

Fig.1.6 - The shaded area shows the set A∩B∩C𝐴∩𝐵∩𝐶.

3.5 COMPLEMENT SET

The complement of a set A𝐴, denoted by Ac𝐴𝑐 or A¯𝐴¯, is the set of all elements that are in
the universal set S𝑆 but are not in A𝐴. In Figure 1.7, A¯𝐴¯ is shown by the shaded area using
a Venn diagram.

3.6 T-NORM

In mathematics, a t-norm (also T-norm or, unabbreviated, triangular norm) is a kind


of binary operation used in the framework of probabilistic metric spaces and in multi-valued
logic, specifically in fuzzy logic. A t-norm generalizes intersection in
a lattice and conjunction in logic. The name triangular norm refers to the fact that in the
framework of probabilistic metric spaces t-norms are used to generalize the triangle
inequality of ordinary metric spaces.

Definition

A t-norm is a function T: [0, 1] × [0, 1] → [0, 1] that satisfies the following properties:

 Commutativity: T(a, b) = T(b, a)

AI&ML 72 R.THANIGAIVEL
 Monotonicity: T(a, b) ≤ T(c, d) if a ≤ c and b ≤ d

 Associativity: T(a, T(b, c)) = T(T(a, b), c)

 The number 1 acts as identity element: T(a, 1) = a

Since a t-norm is a binary algebraic operation on the interval [0, 1], infix algebraic notation is

also common, with the t-norm usually denoted by .

The defining conditions of the t-norm are exactly those of a partially ordered abelian
monoid on the real unit interval [0, 1]. (Cf. ordered group.) The monoidal operation of any
partially ordered abelian monoid L is therefore by some authors called a triangular norm on L.

Classification of t-norms

A t-norm is called continuous if it is continuous as a function, in the usual interval topology on


[0, 1]2. (Similarly for left- and right-continuity.)

A t-norm is called strict if it is continuous and strictly monotone.

A t-norm is called nilpotent if it is continuous and each x in the open interval (0, 1) is nilpotent,

that is, there is a natural number n such that x ... x (n times) equals 0.

A t-norm is called Archimedean if it has the Archimedean property, that is, if for
each x, y in the open interval (0, 1) there is a natural number n such

that x ... x (n times) is less than or equal to y.

The usual partial ordering of t-norms is pointwise, that is,

T1 ≤ T2 if T1(a, b) ≤ T2(a, b) for all a, b in [0, 1].

As functions, pointwise larger t-norms are sometimes called stronger than those pointwise
smaller. In the semantics of fuzzy logic, however, the larger a t-norm, the weaker (in terms
of logical strength) conjunction it represents.

Properties of t-norms

The drastic t-norm is the pointwise smallest t-norm and the minimum is the pointwise largest
t-norm:

for any t-norm and all a, b in [0, 1].

AI&ML 73 R.THANIGAIVEL
For every t-norm T, the number 0 acts as null element: T(a, 0) = 0 for all a in [0, 1].

A t-norm T has zero divisors if and only if it has nilpotent elements; each nilpotent element
of T is also a zero divisor of T. The set of all nilpotent elements is an interval [0, a] or
[0, a), for some a in [0, 1].

3.7 Properties of continuous t-norms(T-CONORM)

Although real functions of two variables can be continuous in each variable without being
continuous on [0, 1]2, this is not the case with t-norms: a t-norm T is continuous if and only if
it is continuous in one variable, i.e., if and only if the functions fy(x) = T(x, y) are continuous
for each y in [0, 1]. Analogous theorems hold for left- and right-continuity of a t-norm.

A continuous t-norm is Archimedean if and only if 0 and 1 are its only idempotents.

A continuous Archimedean t-norm is strict if 0 is its only nilpotent element; otherwise it is


nilpotent. By definition, moreover, a continuous Archimedean t-norm T is nilpotent if and only
if each x < 1 is a nilpotent element of T. increasing function f such that

Thus with a continuous Archimedean t-norm T, either all or none of the elements of
(0, 1) are nilpotent. If it is the case that all elements in (0, 1) are nilpotent, then the t-norm
is isomorphic to the Łukasiewicz t-norm; i.e., there is a strictly

If on the other hand it is the case that there are no nilpotent elements of T, the t-norm is
isomorphic to the product t-norm. In other words, all nilpotent t-norms are isomorphic, the
Łukasiewicz t-norm being their prototypical representative; and all strict t-norms are
isomorphic, with the product t-norm as their prototypical example. The Łukasiewicz t-
norm is itself isomorphic to the product t-norm undercut at 0.25, i.e., to the function p(x, y)
= max(0.25, x ⋅ y) on [0.25, 1]2.

For each continuous t-norm, the set of its idempotents is a closed subset of [0, 1]. Its
complement—the set of all elements that are not idempotent—is therefore a union of
countably many non-overlapping open intervals. The restriction of the t-norm to any of
these intervals (including its endpoints) is Archimedean, and thus isomorphic either to the
Łukasiewicz t-norm or the product t-norm. For such x, y that do not fall into the same open
interval of non-idempotents, the t-norm evaluates to the minimum of x and y. These
conditions actually give a characterization of continuous t-norms, called the Mostert–
Shields theorem, since every continuous t-norm can in this way be decomposed, and the
described construction always yields a continuous t-norm. The theorem can also be
formulated as follows:

AI&ML 74 R.THANIGAIVEL
A t-norm is continuous if and only if it is isomorphic to an ordinal sum of the
minimum, Łukasiewicz, and product t-norm.A similar characterization theorem for
non-continuous t-norms is not known (not even for left-continuous ones), only some
non-exhaustive methods for the construction of t-norms have been found.

3.8 FUZZY RELATIONS

Fuzzy relations also map elements of one universe, say X, to those of another universe, say Y,
through the Cartesian product of the two universes. However, the ‘‘strength’’ of the relation
between ordered pairs of the two universes is not measured with the characteristic function,
but rather with a membership function expressing various ‘‘degrees’’ of strength of the
relation on the unit interval [0,1]. Hence, a fuzzy relation R is a mapping from the Cartesian
×

space X Y to the interval [0,1], where the strength of the mapping is expressed by the

membership function of the relation for ordered pairs from the two universes, or µR(x, y).∼

Cardinality of Fuzzy Relations

Since the cardinality of fuzzy sets on any universe is infinity, the cardinality of a fuzzy
relation between two or more universes is also infinity.

Operations on Fuzzy Relations

Let R and S be fuzzy relations on the Cartesian space X Y. Then the following operations
×
∼ ∼
appl y for t he membership values for various set operations:

Properties of Fuzzy Relations

Just as for crisp relations, the properties of commutativity, associativity, distributivity,


involution, and idempotency all hold for fuzzy relations. Moreover, De Morgan’s principles
hold for fuzzy relations just as they do for crisp (classical) relations, and the null relation, O,
and the complete relation, E, are analogous to the null set and the whole set in set-theoretic
form, respectively. Fuzzy relations are not constrained, as is the case for fuzzy sets in
general, by the excluded middle axioms. Since a fuzzy relation R is also a fuzzy set, there
is overlap between a relation and its complement; hence,

AI&ML 75 R.THANIGAIVEL
As seen in the foregoing expressions, the excluded middle axioms for relations do not result,
in general, in the null relation, O, or the complete relation, E.

Fuzzy Cartesian Product and Composition

Because fuzzy relations in general are fuzzy sets, we can define the Cartesian product to be
a relation between two or more fuzzy sets. Let A be a fuzzy set on universe X and B be a

fuzzy set on universe Y; then the Cartesian prod uct between fuzzy sets A and B
∼ ∼
will resultin a fuzzy relation R, which is contained within the full Cartesian produc t

spac e,

or where the fuzzy relation R has membership function

The Cartesian product defined by A B R, Eq. (3.15), is implemented in the same


∼ ∼ ∼
fashion as is the cross product of two vect ors. Again, the= Cartesian product is not the
sameoperation as the arithmetic product. In the case of two-dimensional relations (r 2),
the

× former employs the idea of pairing of elements among sets, whereas the latter uses actual
× ×
arithmetic products between elements of sets. Each of the fuzzy sets could be thought of as
a vector of membership values; each value is associated with a particular element in each
set. For example, for a fuzzy set (vector) A that has four elements, hence column vector of

size 4 1, and for a fuzzy set (vector) B t hat has five elements, hence a row vector size of 1

5, the resulting fuzzy relation, R, w ill be represented by a matrix of size4

5i.e.,will have four rows and five column s. This result is illustrated in the following

example.

3.9 FUZZYIF-THEN-ELSE RULES

AI&ML 76 R.THANIGAIVEL
uzzy if-then rules are a fundamental component of fuzzy logic systems. These rules are used
to model relationships between input variables and output variables in a fuzzy inference system.
Each rule consists of two main parts: the antecedent (the "if" part) and the consequent (the
"then" part).

Here's how they work:

1. Antecedent (If Part): This part of the rule specifies the conditions under which the rule
should be applied. It typically involves one or more fuzzy sets defined over the input
variables. These fuzzy sets are often characterized by membership functions that
describe the degree to which each input variable satisfies the conditions.

2. Consequent (Then Part): This part of the rule specifies the action or conclusion to be
taken if the conditions specified in the antecedent are met. It typically involves one or
more fuzzy sets defined over the output variables, along with corresponding
membership functions that describe the degree to which the conclusion should be
applied.

For example, consider a simple fuzzy if-then rule in a temperature control system:

Rule 1: If Temperature is Cold then Heater Power is High.

In this rule:

 Antecedent: "Temperature is Cold" is a fuzzy set defined over the variable


"Temperature."

 Consequent: "Heater Power is High" is a fuzzy set defined over the variable "Heater
Power."

The membership functions associated with these fuzzy sets determine the degree to which the
conditions are satisfied and the conclusions are applied. So, if the temperature is moderately
cold, the membership value for "Temperature is Cold" might be, say, 0.6, and this would
influence the degree to which the heater power should be set to high.

In a fuzzy inference system, multiple fuzzy if-then rules are combined and applied to input data
using fuzzy reasoning techniques to determine appropriate output values.

In the boolean system truth value, 1.0 represents the absolute truth value and 0.0 represents
the absolute false value. But in the fuzzy system, there is no logic for the absolute truth and
absolute false value. But in fuzzy logic, there is an intermediate value too present which is
partially true and partially false.

AI&ML 77 R.THANIGAIVEL
3.10 FUZZY REASONING

Fuzzy reasoning is a method of computing that deals with reasoning that is approximate
rather than fixed and exact. It's particularly useful in situations where traditional binary
true/false logic isn't adequate because of the presence of uncertainty or ambiguity.

Here's a brief rundown:

1. Fuzzy Logic: Unlike classical logic, which operates on the principle of binary
true/false, fuzzy logic allows for degrees of truth. Instead of "true" or "false," statements
can be "partially true" or "mostly true" on a scale from 0 to 1.

2. Fuzzy Sets: Fuzzy logic uses fuzzy sets, which allow elements to have partial
membership in a set. For example, in classical set theory, an element is either in the set
or not. In fuzzy set theory, membership is a matter of degree.

3. Membership Functions: These functions determine how much an element belongs to


a fuzzy set. They assign a membership value to each element based on its degree of
membership in the set.

4. Fuzzy Inference Systems: These are systems that use fuzzy logic to represent and
process data. They consist of fuzzy input variables, fuzzy inference rules, a fuzzy
inference engine, and fuzzy output variables.

5. Applications: Fuzzy reasoning is widely used in control systems, pattern recognition,


decision analysis, expert systems, and many other fields where imprecision and
uncertainty are present.

In essence, fuzzy reasoning provides a framework for dealing with problems that are too
complex or uncertain to be adequately addressed by traditional logic.

AI&ML 78 R.THANIGAIVEL
3.11 Neuro-Fuzzy Modeling- ANFIS

3.11.1 Adaptive Neuro-Fuzzy Inference System:

Adaptive Neuro-Fuzzy Inference System or Adaptive network-based fuzzy


inference system (ANFIS) is a kind of artificial neural network that is based on Takagi–
Sugeno fuzzy inference system. The technique was developed in the early 1990s.Since it
integrates both neural networks and fuzzy logic principles, it has potential to capture the
benefits of both in a single framework.

Its inference system corresponds to a set of fuzzy IF–THEN rules that have learning capability
to approximate nonlinear functions Hence, ANFIS is considered to be a universal
estimator.[4] For using the ANFIS in a more efficient and optimal way, one can use the best
parameters obtained by genetic algorithm.It has uses in intelligent situational aware energy
management system.

3.12 ANFIS architecture

It is possible to identify two parts in the network structure, namely premise and
consequence parts. In more details, the architecture is composed by five layers. The first layer
takes the input values and determines the membership functions belonging to them. It is
commonly called fuzzification layer. The membership degrees of each function are computed
by using the premise parameter set, namely {a,b,c}. The second layer is responsible of
generating the firing strengths for the rules. Due to its task, the second layer is denoted as "rule
layer". The role of the third layer is to normalize the computed firing strengths, by dividing
each value for the total firing strength. The fourth layer takes as input the normalized values
and the consequence parameter set {p,q,r}. The values returned by this layer are the
defuzzificated ones and those values are passed to the last layer to return the final output.

AI&ML 79 R.THANIGAIVEL
Fuzzification layer

The first layer of an ANFIS network describes the difference to a vanilla neural
network. Neural networks in general are operating with a data pre-processing step, in which
the features are converted into normalized values between 0 and 1. An ANFIS neural network
doesn't need a sigmoid function, but it's doing the preprocessing step by converting numeric
values into fuzzy values.[9]

Here is an example: Suppose, the network gets as input the distance between two points in the
2d space. The distance is measured in pixels and it can have values from 0 up to 500 pixels.
Converting the numerical values into Fuzzy numbers is done with the membership function
which consists of semantic descriptions like near, middle and far.[10] Each possible linguistic
value is given by an individual neuron. The neuron “near” fires with a value from 0 until 1, if
the distance is located within the category "near". While the neuron “middle” fires, if the
distance in that category. The input value “distance in pixels” is split into three different
neurons for near, middle and far.

APPLICATIONS:

The Adaptive Neuro-Fuzzy Inference System (ANFIS) has found applications across
various domains due to its ability to model complex systems, handle uncertainties, and adapt
to changing environments. Here are some notable applications:

1. Control Systems: ANFIS is widely used in modeling and control applications,


including industrial process control, robotics, and automotive control systems. It can
effectively learn the nonlinear dynamics of these systems and provide accurate control
strategies.

2. Financial Forecasting: ANFIS has been applied in financial time series analysis for
tasks such as stock price prediction, foreign exchange rate forecasting, and credit risk
assessment. Its ability to capture nonlinear relationships and adapt to changing market
conditions makes it valuable in financial modeling.

3. Pattern Recognition: ANFIS is used in pattern recognition tasks such as speech


recognition, image processing, and biometric identification. It can learn complex
patterns from input data and make accurate classifications or predictions.

4. Medical Diagnosis and Healthcare: ANFIS is employed in medical diagnosis systems


for disease diagnosis, patient monitoring, and treatment optimization. It can integrate
heterogeneous data sources and provide decision support for healthcare professionals.

5. Power Systems: ANFIS is utilized in power systems engineering for load forecasting,
fault detection, and energy management. It can handle the nonlinear and uncertain
nature of power system dynamics and assist in optimizing system performance.

AI&ML 80 R.THANIGAIVEL
6. Environmental Modeling: ANFIS is applied in environmental modeling for tasks such
as air quality prediction, water quality assessment, and ecological modeling. It can
analyze complex environmental data and aid in decision-making for environmental
management and planning.

7. Process Optimization: ANFIS is used in industrial process optimization for tasks such
as parameter tuning, product quality control, and energy efficiency improvement. It can
learn from historical process data and suggest optimal operating conditions.

8. Marketing and Customer Relationship Management (CRM): ANFIS is employed


in marketing analytics and CRM systems for customer segmentation, churn prediction,
and recommendation systems. It can analyze customer behavior patterns and assist in
targeted marketing strategies.

9. Traffic and Transportation Engineering: ANFIS is utilized in traffic flow prediction,


congestion detection, and transportation planning. It can model complex traffic patterns
and assist in optimizing traffic management strategies.

10. Agricultural Systems: ANFIS is applied in precision agriculture for tasks such as crop
yield prediction, pest detection, and irrigation scheduling. It can analyze agricultural
data and provide recommendations for optimal crop management practices.

These are just a few examples of the diverse range of applications where ANFIS has been
successfully employed. Its versatility and effectiveness make it a valuable tool in addressing
complex problems across various domains.

3.13 NEURO-FUZZY MODELING -HYBRID LEARNING ALGORITHM

In neuro-fuzzy modeling, a hybrid learning algorithm combines the advantages of both neural
networks and fuzzy logic to create robust and adaptive models. Here's an overview of a typical
hybrid learning algorithm used in neuro-fuzzy modeling:

1. Initialization:

o Initialize the parameters of the fuzzy logic system, such as membership


functions and rule bases.

o Initialize the parameters of the neural network, such as weights and biases.

2. Fuzzy System Identification:

o Utilize the available data to identify the structure and parameters of the fuzzy
logic system.

o Determine the number of fuzzy rules, shape of membership functions, and rule
bases using techniques like clustering or expert knowledge.

AI&ML 81 R.THANIGAIVEL
3. Neural Network Training:

o Train the neural network using backpropagation or other optimization


algorithms to learn the relationship between inputs and outputs.

o The input to the neural network can be the outputs of the fuzzy logic system or
a combination of original inputs and fuzzy logic outputs.

4. Integration of Fuzzy Logic and Neural Network:

o Combine the outputs of the fuzzy logic system and the neural network to
generate the final output.

o This integration can be done by simple averaging, weighted averaging, or more


sophisticated fusion techniques.

5. Adaptation and Refinement:

o Refine the parameters of both the fuzzy logic system and the neural network
based on feedback from the model's performance.

o Use techniques like genetic algorithms, particle swarm optimization, or


reinforcement learning to optimize the parameters further.

6. Validation and Testing:

o Evaluate the performance of the hybrid model on unseen data to ensure its
generalization capability.

o Fine-tune the model as necessary based on validation results.

7. Iterative Improvement:

o Iterate through steps 3 to 6, adjusting the model architecture and parameters as


needed to improve performance.

By combining the strengths of fuzzy logic and neural networks, the hybrid learning algorithm
in neuro-fuzzy modeling can capture complex relationships in data while maintaining
interpretability and adaptability. This approach is particularly useful for problems with
nonlinearities, uncertainties, and incomplete information.\

3.12.1 ANFIS HYBRID LEARNING MODEL:

The Adaptive Neuro-Fuzzy Inference System (ANFIS) is a popular neuro-fuzzy modeling


technique that combines the adaptive capabilities of neural networks with the interpretability
of fuzzy logic systems. ANFIS is particularly useful for modeling complex systems where the
relationships between inputs and outputs are not well-defined or understood.

AI&ML 82 R.THANIGAIVEL
ANFIS architecture typically consists of five layers:

1. Input Layer: This layer receives the input variables/features of the system. Each node
in this layer represents one input variable.

2. Fuzzy Inference Layer: In this layer, each node computes the degree of membership
of the input variables to each fuzzy set. This layer performs fuzzification, converting
crisp inputs into fuzzy values based on predefined membership functions.

3. Rule Layer: Here, the firing strength of each rule is calculated by combining the
degrees of membership from the fuzzy inference layer. Each node represents a rule in
the fuzzy rule base.

4. Consequent Layer: This layer computes the output of each rule by multiplying the
firing strength of the rule (from the rule layer) with the consequent parameters (typically
linear coefficients).

5. Output Layer: The output layer aggregates the outputs of all rules to produce the final
output of the system.

The parameters of the membership functions and the linear coefficients in ANFIS are adapted
through a hybrid learning algorithm, which typically involves a combination of gradient-based
methods and least squares optimization. This hybrid learning algorithm aims to minimize the
error between the actual and predicted outputs of the system.

The hybrid learning algorithm in ANFIS can be summarized as follows:

1. Initialization: Initialize the parameters of the membership functions and the


consequent parameters randomly or using some heuristic method.

2. Forward Pass: Compute the output of the ANFIS model for each input sample by
propagating the inputs through the network according to the defined architecture.

3. Error Calculation: Calculate the error between the predicted output and the actual
output for each sample.

4. Backpropagation: Update the parameters of the membership functions and the


consequent parameters using a gradient-based optimization algorithm (e.g., gradient
descent) to minimize the error.

5. Parameter Adjustment: Refine the parameters further using a least squares


optimization approach, which adjusts the parameters to minimize the overall error
across all samples.

6. Convergence: Repeat steps 2-5 iteratively until convergence criteria are met (e.g., a
predefined number of iterations or a sufficiently small error threshold).

AI&ML 83 R.THANIGAIVEL
ANFIS is widely used in various fields such as system identification, control systems, time-
series prediction, and modeling of complex nonlinear systems due to its ability to capture both
explicit and implicit knowledge from data.

3.12.2 CASE STUDY OF HYBRID LEARNING MODEL

Predictive Maintenance in Manufacturing:

1. Fuzzy Logic System:

o Fuzzy logic can be used to model the rules that govern equipment health based
on factors such as temperature, vibration, and usage time.

o Linguistic rules might include statements like "If vibration is high and
temperature is high, then the machine is likely to fail soon."

o Fuzzy membership functions define the degree to which input variables belong
to various fuzzy sets (e.g., "low," "medium," "high").

2. Neural Network:

o A neural network can learn complex patterns and correlations in sensor data that
may not be explicitly captured by fuzzy rules.

o The neural network can analyze historical sensor data to predict the remaining
useful life of machinery or the likelihood of failure.

3. Hybrid Integration:

o Combine the output of the fuzzy logic system and the neural network using
weighted averaging or other fusion techniques.

o The weights can be adjusted based on the performance of each component in


predicting equipment health or failure.

4. Training and Adaptation:

o During the training phase, the parameters of both the fuzzy logic system (e.g.,
membership functions, rule bases) and the neural network (e.g., weights, biases)
are adjusted using a hybrid optimization algorithm.

o This could involve a combination of gradient-based optimization for the neural


network and evolutionary algorithms for optimizing the parameters of the fuzzy
logic system.

AI&ML 84 R.THANIGAIVEL
5. Validation and Testing:

o Evaluate the hybrid model's performance on unseen sensor data to assess its
ability to accurately predict equipment health or failure.

o Performance metrics such as precision, recall, or F1 score can be used to


quantify the model's accuracy.

6. Iterative Improvement:

o Based on validation results, refine the model architecture and parameters


iteratively to improve predictive performance.

o This might involve adding more fuzzy rules, adjusting membership functions,
or changing the neural network architecture.

By combining fuzzy logic and neural networks in a hybrid learning model for predictive
maintenance, manufacturers can leverage the interpretability of fuzzy logic while harnessing
the predictive power of neural networks to optimize maintenance schedules, reduce downtime,
and extend equipment lifespan.

AI&ML 85 R.THANIGAIVEL
UNIT IV

EVOLUTIONARY COMPUTATION s GENETIC ALGORITHMS


Evolutionary Computation (EC) – Features of EC – Classification of EC – Advantages –
Applications. Genetic Algorithms: Introduction – Biological Background – Operators in GA-
GA Algorithm – Classification of GA – Applications

1. Evolutionary Computation (EC)

Evolutionary computation is a branch of computer science that leverages concepts from


biological evolution to solve complex computational problems. These problems often involve
searching through extensive possibilities, such as finding optimal hardware layouts,
predictive equations for financial markets, or control rules for robots in changing
environments. Traditional programming approaches struggled with these tasks, as the rules
for intelligence and adaptation are too intricate for manual encoding.

Researchers now favour a bottom-up approach where simple rules are combined with
adaptive systems to allow complex behaviours to emerge. This contrasts with earlier attempts
to manually encode intelligence, such as in expert systems. Examples of this approach include
neural networks and evolutionary computation.

Biological evolution is particularly inspiring because it effectively searches through


numerous possibilities to find solutions that enable organisms to survive and adapt. The process
of evolution, driven by simple rules of random variation and natural selection, results in the
rich complexity of life. Evolutionary computation mimics these principles through
evolutionary algorithms, with genetic algorithms (GAs) being the most prominent method. This
review focuses on genetic algorithms within the broader context of evolutionary computation.

History:

In the 1950s and 1960s, several computer scientists independently developed various
evolutionary computation methods. In the 1960s, Rechenberg introduced evolution strategies,
optimizing real-valued parameters using a parent-child mutation model, further developed by
Schwefel. Evolutionary programming, developed by Fogel, Owens, and Walsh, evolved finite-
state machines through random mutations. Holland invented genetic algorithms (GAs) in the
1960s, focusing on bit strings and using mutation and crossover for variation. Unlike evolution
strategies and evolutionary programming, Holland aimed to study natural adaptation formally.

AI&ML 86 R.THANIGAIVEL
Over time, the distinctions between different evolutionary computation methods, such as
evolution strategies, evolutionary programming, and genetic algorithms, have blurred, with
researchers increasingly interacting and integrating these approaches. This review focuses
mainly on genetic algorithms, examining their applications in business, science, and
education, and discussing their relevance to evolutionary biology. The review does not cover
the extensive theoretical foundations, which are detailed in the Foundations of Genetic
Algorithms proceedings.

2. Features of EC
1. Commercial Applications of Evolutionary Algorithms

Evolutionary algorithms are used in various commercial and scientific applications to search
through numerous possibilities for good, if not optimal, solutions, a concept termed
"satisficing" by Simon. Traditional search methods, such as hill-climbing, work well for many
problems, but evolutionary algorithms are particularly useful when there are many parameters and
high complexity.

Key Applications:

Drug Design: Evolutionary algorithms help design drugs by predicting how well ligands bind
to enzymes, crucial for creating inhibitors for HIV protease enzymes. Natural Selection Inc.
provided software to Aguron Pharmaceuticals that combines ligand-protein interaction models
with evolutionary programming to explore ligand-protein configurations.

Supply-Chain Management: Companies like Volvo and Deere C Company use evolutionary
algorithms for scheduling production. For instance, I2 Technologies’ scheduling program evolves
schedules for plant production, optimizing inventory and manufacturing processes.

Stock Market Prediction: Financial institutions like Citibank and Swiss Bank use evolutionary
algorithms to predict stock market prices. These programs evolve by backcasting, predicting
recent data to refine their accuracy.

Evolvable Hardware (EHW): Asahi Microsystems developed an EHW chip for cellular
phones that adjusts parameters to meet performance specifications, improving yield rates and
reducing circuit size and power consumption. The lab of T Higuchi at Tsukuba demonstrated this
technology, leading to the production of tunable chips.
These examples illustrate the growing use of evolutionary algorithms in practical, high-stakes
environments, ranging from pharmaceuticals to manufacturing and finance. Additional
applications and studies can be found in journals such as Evolutionary Computation and IEEE
Transactions on Evolutionary Computing.

AI&ML 87 R.THANIGAIVEL
2. Using Genetic Programming for Optimal Foraging Strategies

Koza, Rice, C Roughgarden utilized genetic programming, a form of


evolutionary algorithm, to explore optimal foraging strategies and the evolution
of complex behaviors from simple components. Building on Roughgarden's
research on Anolis lizards, they aimed to devise strategies maximizing food
capture per unit time.

Their model considered four variables: insect abundance, lizard sprint velocity,
and the coordinates of the insect in the lizard's view. A strategy function
determined whether an insect should be chased based on these variables.
Simulations involved assigning values to these variables and allowing the lizard
to chase insects for a set time period.

In one experiment, the lizard's viewing area was divided into three regions with
different escape probabilities for insects. The optimal strategy involved ignoring
insects in one region, chasing those in another region without escape, and chasing
selectively in a third region based on escape probability and distance.

The genetic programming algorithm, developed by Koza, represented


individuals as computer programs encoded as parse trees. This approach
facilitated the evolution of optimal strategies, even in complex.

AI&ML 88 R.THANIGAIVEL
Classification of EC – Advantages – Applications

 Genetic Algorithms (GA): Genetic Algorithms are the most well-known form of EC1.
They evolve a population of solutions, each represented as a bit-string (the genotype),
with a fitness function measuring the fitness of the bit-string within the context of the
problem.
 Advantages: GAs are versatile and can be applied to a wide range of problem
domains. They are capable of solving many problems competently.
 Applications: GAs have been applied in healthcare for enhancing diagnosis
precision and treatment optimization, in finance for improving predictive accuracy
in market trends and portfolio management, and in supply chain for optimizing
routes, reducing costs and delivery times.
 Genetic Programming (GP): This is a variant of GA that evolves computer
programs, typically represented as tree structures.

AI&ML 89 R.THANIGAIVEL
 Advantages: GP is powerful for performing symbolic regressions and feature
classifications4. It saves time by processing large amounts of data much more
quickly than humans can.
 Applications: GP has been used in conjunction with other forms of machine
learning.
 Evolution Strategies (ES): These are optimization algorithms that use mechanisms
inspired by biological evolution, such as mutation, recombination, and selection.
 Advantages: ESs are successful global optimization methods5. They can optimize
in restricted solution spaces.
 Applications: ESs are mainly used for the simulation-based optimization, i.e.,
computerized models that require parametric optimization.
 Differential Evolution (DE): This is a method that optimizes a problem by iteratively
improving candidate solutions with regard to a given measure of quality.
 Advantages: DE is simple and easy to implement. It has strong robustness, fewer control
parameters, and better search capabilities.
 Applications: DE has been applied to tackle diverse problems across various fields
and real-world applications.
 Evolutionary Programming (EP): This is similar to ES, but focuses more on the
evolution of program structures.
 Advantages: EP places greater emphasis on its own development, thus it has the
advantages of simple description, flexible use, high efficiency, strong robustness,
and a limited number of conditions.
 Applications: EP has been effective in solving problems with a variety of
characteristics, and within many application domains.
 Permutation-based Evolutionary Algorithms: These are used for combinatorial
optimization problems where the solution can be represented as a sequence or
permutation of numbers.
 Advantages: Permutation-based encoding is used by many evolutionary algorithms
dealing with combinatorial optimization problems.
 Applications: Permutation-based Evolutionary Algorithms have been applied in many
fields such as engineering, design, medicine, robotics, science, etc.
 Memetic Algorithms (MA): These are based on the concept of a meme, which is
considered as a unit of information that can be replicated and selected.
 Advantages: MAs combine exploration with exploitation abilities provided by the local
search, and they reduce the premature convergence to local optima due to a better
exploration of the solutions space.

AI&ML 90 R.THANIGAIVEL
 Applications: MAs have been applied to solve several versatile real-world
optimization tasks, ranging from robotics, wireless networks, power systems, job shop
scheduling, to classification and training of artificial neural networks.
 Estimation of Distribution Algorithms (EDA): These algorithms replace traditional
genetic operators by building and sampling a probabilistic model of promising
solutions.
 Advantages: EDAs are a most successful paradigm of EAs14. They are derived by
inspirations from evolutionary computation and machine learning.
 Applications: EDAs have been effective in solving problems with a variety of
characteristics, and within many application domains.
 Particle Swarm Optimization (PSO): This is a computational method that optimizes
a problem by iteratively trying to improve a candidate solution with regard to a given
measure of quality.
 Advantages: PSO is a flexible and easy-to-implement algorithm that doesn’t require
hyperparameter tuning. It is a versatile tool for numerous optimization problems.
 Applications: PSO has been applied in health-care, environmental, industrial,
commercial, smart city, and general aspects applications.
 Interactive Evolutionary Algorithms: These involve human interaction, often in the
role of fitness function.
 Advantages: Interactive Evolutionary Algorithms provide insight and guidance
beyond simply selecting parents for breeding.
 Applications: Applications of Interactive Evolutionary Algorithms (IEAs) range
from capturing aesthetics in art and design, to the personalisation of artefacts such as
medical devices

Types of evolutionary algorithms?

What are the types of evolutionary algorithms?

There are various types of evolutionary


algorithms. Here are the most significant ones:

Genetic algorithms (GA) Genetic


Programming (GP) Evolutionary
Programming (EP) Evolutionary
Strategies (ES)

AI&ML 91 R.THANIGAIVEL
Many more evolutionary algorithms also exist. These include Gene Expression Programming,
Differential Evolution, Learning Classifier Systems, and Neuroevolution.

Genetic algorithms (GA)


Genetic algorithms are the most popular type of evolutionary algorithms. They find solutions
to problems as strings of numbers. Most of these strings are binary, but the most effective ones
tend to show something about the problem in question.

These algorithms make use of operators like mutation and recombination. Sometimes they use
both operators together.

One of the uses of genetic algorithms is selecting the right combination of variables to build a
predictive model. Selecting the right subset of variables is essentially a combinatory and
optimization problem.

The advantage of genetic algorithms is that it makes it possible for the best solution to emerge
from the best of prior solutions. It improves the selection over time.

The whole idea behind genetic algorithms is to combine the differnet solutions generation
after generation so that it can extract the best genes or variables from each solution. It helps
creating better fitted individuals.

Genetic algorithms are also used for hyper-tuning parameters, finding the maximum or minimum
of a function, or the search for a correct neural network architecture (Neuroevolution). It is
also used in feature selection.

The idea of genetic algorithms (GA) is to generate a few random possible solutions that represent
different variables, and then combine the best possible solutions in an iterative process. The
basic genetic algorithm operations are selection (picking the most fitted solutions in a generation),
cross-over (creating two new individuals, based on the genes of solutions), and mutation
(changing a gene randomly in an individual).

Genetic Programming (GP)


Here, the solutions to problems are computer programs. The ability of these computer programs
to solve computational problems is what determines their fitness.

Genetic Programming (GP) is essentially an automatic programming technique which favors


the evolution of computer programs that solve or at least approximately solve problems. It
involves essentially ‘breeding’ programs by continuously improving an initially random set
of programs.

AI&ML 92 R.THANIGAIVEL
Improvements are made by stochastic variation of programs and selection in line with some
predefined criteria for judging the quality of a solution. Programs of genetic programming
systems essentially evolve to solve predescribed automatic programming and machine learning
problems.

In its essence, genetic programming is a heuristic search technique that is commonly called ‘hill
climbing’. It involves searching for an optimal or at least a suitable program among the space of
all programs.

Evolutionary Programming (EP)


This is not too different from Genetic Programming. However, in Evolutionary Programming,
the programs that need to be optimized have a fixed structure, while the numeric parameters can
evolve.

This evolutionary algorithm paradigm was first used by Lawrence J. Fogel in 1960 in an attempt
to use simulated evolution as a learning process seeking to create artificial intelligence. He
used finite-state machines as predictors and evolved them. Right now, evolutionary programming
is a wide evolutionary computing dialect that has no fixed structure or representation. It is
becoming increasingly difficult to differentiate evolutionary programming from evolutionary
strategies.

The main operator of evolutionary programming is mutation. In evolutionary programming,


members of the population are seen as part of a specific species rather than members of the same
species. Every parent generates an offspring by using a (μ + μ) survivor selection.

Evolutionary Strategies (ES)


Evolutionary strategies usually work by making use of self-adaptive mutation rates. They work with
vectors of real numbers as representations of solutions.

Evolutionary strategies are optimization techniques that are based on the ideas of evolution.
They use natural problem-dependent representations and mainly make use of mutation and
selection as search operators. The operators are applied in a loop, an iteration of which is known
as a generation. The sequence of generations continues till a termination criterion is met. Most
evolutionary algorithms work on a genotype level, but evolutionary strategies work on a behavioral
level.

Since the physical expression is coded directly, an individual’s genes are not mapped to
its physical expression.

AI&ML 93 R.THANIGAIVEL
This approach is followed to give rise to a strong causality so that a small change in the coding
gives rise to a small change in the individual and a large change in the coding causes a large
change in the individual.

Example 1: evolving neural networks for image recognition

In the domain of computer vision, evolutionary computation techniques have been employed
to evolve neural network architectures and parameters for image recognition tasks. By iteratively
optimizing the network's structure and connection weights, evolutionary algorithms
contribute to the development of robust and efficient image recognition systems.

Example 2: genetic algorithms in financial modeling

Financial modeling often involves complex optimization problems, such as portfolio


optimization and risk management. Genetic algorithms, a subset of evolutionary
computation, offer powerful tools for solving such intricately interconnected and dynamic
problems, contributing to more effective decision-making strategies within the financial domain.

Example 3: evolutionary strategies in robotics

In the realm of robotics, evolutionary strategies have been harnessed to optimize the
locomotion and control mechanisms of robotic systems. By evolving the parameters and
behaviors of robotic agents through simulated evolutionary processes, researchers have achieved
advancements in adaptive and resilient robotic locomotion strategies.

Genetic Algorithms: Introduction – Biological Background – Operators in GA-GA


Algorithm – Classification of GA – Applications

Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of natural
selection and genetics. These are intelligent exploitation of random searches provided with
historical data to direct the search into the region of better performance in solution space. They
are commonly used to generate high-quality solutions for optimization problems and search

AI&ML 94 R.THANIGAIVEL
problems.

Genetic algorithms simulate the process of natural selection which means those species that
can adapt to changes in their environment can survive and reproduce and go to the next generation.
In simple words, they simulate “survival of the fittest” among individuals of consecutive
generations to solve a problem. Each generation consists of a population of individuals and each
individual represents a point in search space and possible solution. Each individual is
represented as a string of character/integer/float/bits. This string is analogous to the
Chromosome.

Foundation of Genetic Algorithms

Genetic algorithms are based on an analogy with the genetic structure and behavior of
chromosomes of the population. Following is the foundation of GAs based on this analogy –

Individuals in the population compete for resources and mate

Those individuals who are successful (fittest) then mate to create more offspring than others

Genes from the “fittest” parent propagate throughout the generation, that is sometimes
parents create offspring which is better than either parent.

Thus each successive generation is more suited for their environment.

Search space

The population of individuals are maintained within search space. Each individual represents
a solution in search space for given problem. Each individual is coded as a finite length vector
(analogous to chromosome) of components. These variable components are analogous to
Genes. Thus a chromosome (individual) is composed of several genes (variable components).

Fitness Score

A Fitness Score is given to each individual which shows the ability of an individual to
“compete”. The individual having optimal fitness score (or near optimal) are sought.

The GAs maintains the population of n individuals (chromosome/solutions) along with their
fitness scores.The individuals having better fitness scores are given more chance to reproduce
than others. The individuals with better fitness scores are selected who mate and produce better
offspring by combining chromosomes of parents. The population size is static so the room
has to be created for new arrivals. So, some individuals die and get replaced by new arrivals

AI&ML 95 R.THANIGAIVEL
eventually creating new generation when all the mating opportunity of the old population is
exhausted. It is hoped that over successive generations better solutions will arrive while least fit
die.

Each new generation has on average more “better genes” than the individual (solution) of previous
generations. Thus each new generations have better “partial solutions” than previous generations.
Once the offspring produced having no significant difference from offspring produced by previous
populations, the population is converged. The algorithm is said to be converged to a set of
solutions for the problem.

Operators of Genetic Algorithms

Once the initial generation is created, the algorithm evolves the generation using following
operators
1) Selection Operator: The idea is to give preference to the individuals with good fitness scores
and allow them to pass their genes to successive generations.
2) Crossover Operator: This represents mating between individuals. Two individuals are
selected using selection operator and crossover sites are chosen randomly. Then the genes at
these crossover sites are exchanged thus creating a completely new individual (offspring). For
example –

3) Mutation Operator: The key idea is to insert random genes in offspring to maintain the diversity
in the population to avoid premature convergence. For example –

The whole algorithm can be summarized as –

1. Randomly initialize population

2. Determine fitness of population

3. Until convergence repeat:


AI&ML 96 R.THANIGAIVEL
a) Select parents from population

b) Crossover and generate new population

c) Perform mutation on new population

d) Calculate fitness for new population

Example problem and solution using Genetic Algorithms

Given a target string, the goal is to produce target string starting from a random string of the same
length. In the following implementation, following analogies are made –

Characters A-Z, a-z, 0-9, and other special symbols are considered as genes

A string generated by these characters is considered


as chromosome/solution/Individual

Fitness score is the number of characters which differ from characters in target string at a particular
index. So individual having lower fitness value is given more preference.

// C++ program to create target string, starting from


// random string using Genetic Algorithm

#include
<bits/stdc++.h> using
namespace std;

// Number of individuals in each generation


#define POPULATION_SIZE 100

// Valid Genes

const string GENES =


"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP"\
"QRSTUVWXYZ 1234567890, .-;:_!\"#%C/()=?@${[]}";

// Target string to be generated


const string TARGET = "I love GeeksforGeeks";

// Function to generate random numbers in given range


int random_num(int start, int end)

AI&ML 97 R.THANIGAIVEL
{
int range = (end-start)+1;
int random_int = start+(rand()%range);
return random_int;

// Create random genes for mutation


char mutated_genes()

{
int len = GENES.size();
int r = random_num(0, len-
1); return GENES[r];

// create chromosome or string of


genes string create_gnome()

{
int len =
TARGET.size();

string gnome = "";


for(int i =
0;i<len;i++)

gnome += mutated_genes();
return gnome;

// Class representing individual in


population class Individual

{
public:
string
chromosome; int
fitness;

AI&ML 98 R.THANIGAIVEL
Individual(string chromosome);
Individual mate(Individual parent2);
int cal_fitness();

};
Individual::Individual(string chromosome)
{
this->chromosome =
chromosome; fitness =
cal_fitness();

};

// Perform mating and produce new


offspring Individual
Individual::mate(Individual par2)

{
// chromosome for offspring
string child_chromosome = "";

int len =
chromosome.size();
for(int i = 0;i<len;i++)

{
// random probability
float p = random_num(0, 100)/100;

// if prob is less than 0.45, insert gene


// from parent 1
if(p < 0.45)

child_chromosome += chromosome[i];

// if prob is between 0.45 and 0.90, insert


// gene from parent 2
else if(p < 0.90)
child_chromosome += par2.chromosome[i];

AI&ML 99 R.THANIGAIVEL
// otherwise insert random gene(mutate),
// for maintaining
diversity else

child_chromosome += mutated_genes();
}

// create new Individual(offspring) using


// generated chromosome for offspring
return Individual(child_chromosome);

};

// Calculate fitness score, it is the number of


// characters in string which differ from target
// string.
int Individual::cal_fitness()
{
int len =
TARGET.size(); int
fitness = 0;

for(int i = 0;i<len;i++)
{
if(chromosome[i] !=
TARGET[i]) fitness++;

}
return fitness;
};

// Overloading < operator


bool operator<(const Individual Cind1, const Individual Cind2)
{
return ind1.fitness < ind2.fitness;
}

AI&ML 100 R.THANIGAIVEL


// Driver code
int main()

{
srand((unsigned)(time(0)));

// current
generation int
generation = 0;

vector<Individual> population;
bool found = false;

// create initial population


for(int i = 0;i<POPULATION_SIZE;i++)
{
string gnome = create_gnome();
population.push_back(Individual(gnome));

while(! found)
{
// sort the population in increasing order of fitness score
sort(population.begin(), population.end());

// if the individual having lowest fitness score ie.


// 0 then we know that we have reached to the target
// and break the loop
if(population[0].fitness <= 0)

{
found = true;
break;

AI&ML 101 R.THANIGAIVEL


// Otherwise generate new offsprings for new generation
vector<Individual> new_generation;

// Perform Elitism, that mean 10% of fittest population


// goes to the next generation
int s = (10*POPULATION_SIZE)/100;
for(int i = 0;i<s;i++)
new_generation.push_back(population[i]);

// From 50% of fittest population, Individuals


// will mate to produce
offspring s =
(90*POPULATION_SIZE)/
100;

for(int i = 0;i<s;i++)
{
int len = population.size();
int r = random_num(0,
50);

Individual parent1 =
population[r]; r =
random_num(0, 50);

Individual parent2 = population[r];


Individual offspring = parent1.mate(parent2);
new_generation.push_back(offspring);

}
population = new_generation;
cout<< "Generation: " << generation << "\t";
cout<< "String: "<< population[0].chromosome <<"\t";
cout<< "Fitness: "<< population[0].fitness << "\n";

generation++;
}
cout<< "Generation: " << generation << "\t";

AI&ML 102 R.THANIGAIVEL


cout<< "String: "<< population[0].chromosome
<<"\t"; cout<< "Fitness: "<< population[0].fitness
<< "\n";

Output:
Generation: 1 String: tO{"-?=jH[k8=B4]Oe@} Fitness: 18
Generation: 2 String: tO{"-?=jH[k8=B4]Oe@} Fitness: 18
Generation: 3 String: .#lRWf9k_Ifslw #O$k_ Fitness: 17
Generation: 4 String: .-1Rq?9mHqk3Wo]3rek_ Fitness: 16
Generation: 5 String: .-1Rq?9mHqk3Wo]3rek_ Fitness: 16
Generation: 6 String: A#ldW) #lIkslw cVek) Fitness: 14
Generation: 7 String: A#ldW) #lIkslw cVek) Fitness: 14
Generation: 8 String: (, o x _x%Rs=, 6Peek3 Fitness: 13
.
.
.
Generation: 29 String: I lope Geeks#o, Geeks Fitness: 3
Generation: 30 String: I loMe GeeksfoBGeeks Fitness: 2
Generation: 31 String: I love Geeksfo0Geeks Fitness: 1
Generation: 32 String: I love Geeksfo0Geeks Fitness: 1

Generation: 34 String: I love GeeksforGeeks Fitness: 0

Note: Every-time algorithm start with random strings, so output may differ

As we can see from the output, our algorithm sometimes stuck at a local optimum solution,
this can be further improved by updating fitness score calculation algorithm or by tweaking
mutation and crossover operators.

Why use Genetic

Algorithms They

are Robust

Provide optimisation over large space state.

Unlike traditional AI, they do not break on slight change in input or presence of noise

Application of Genetic Algorithms

Genetic algorithms have many applications, some of them are –

Recurrent Neural

Network Mutation

testing

AI&ML 103 R.THANIGAIVEL


Code breaking

Filtering and signal processing

Learning fuzzy rule base etc

A genetic operator is an operator used in genetic algorithms to guide the algorithm towards a
solution to a given problem. There are three main types of operators (mutation, crossover and
selection), which must work in conjunction with one another in order for the algorithm to be
successful. Genetic operators are used to create and maintain genetic diversity (mutation
operator), combine existing solutions (also known as chromosomes) into new solutions
(crossover) and select between solutions (selection). In his book discussing the use of genetic
programming for the optimization of complex problems, computer scientist John Koza has also
identified an 'inversion' or 'permutation' operator; however, the effectiveness of this operator
has never been conclusively demonstrated and this operator is rarely discussed.

Mutation (or mutation-like) operators are said to be unary operators, as they only operate on
one chromosome at a time. In contrast, crossover operators are said to be binary operators, as
they operate on two chromosomes at a time, combining two existing chromosomes into one
new chromosome.

Operators

Genetic variation is a necessity for the process of evolution. Genetic operators used in genetic
algorithms are analogous to those in the natural world: survival of the fittest, or selection;
reproduction (crossover, also called recombination); and mutation.
Selection

Main article: Selection (genetic algorithm)

Selection operators give preference to better solutions (chromosomes), allowing them to


pass on their 'genes' to the next generation of the algorithm. The best solutions are determined
using some form of objective function (also known as a 'fitness function' in genetic algorithms),
before being passed to the crossover operator. Different methods for choosing the best solutions
exist, for example, fitness proportionate selection and tournament selection; different methods
may choose different solutions as being 'best'. The selection operator may also simply pass the
best solutions from the current generation directly to the next generation without being
mutated; this is known as elitism or elitist selection.

Crossover
Main article: Crossover (genetic algorithm)

AI&ML 104 R.THANIGAIVEL


Crossover is the process of taking more than one parent solutions (chromosomes) and
producing a child solution from them. By recombining portions of good solutions, the genetic
algorithm is more likely to create a better solution. As with selection, there are a number of
different methods for combining the parent solutions, including the edge recombination
operator (ERO) and the 'cut and splice crossover' and 'uniform crossover' methods. The
crossover method is often chosen to closely match the chromosome's representation of the
solution; this may become particularly important when variables are grouped together as
building blocks, which might be disrupted by a non-respectful crossover operator. Similarly,
crossover methods may be particularly suited to certain problems; the ERO is generally
considered a good option for solving the travelling salesman problem.

Mutation

Main article: Mutation (genetic algorithm)

The mutation operator encourages genetic diversity amongst solutions and attempts to
prevent the genetic algorithm converging to a local minimum by stopping the solutions becoming
too close to one another. In mutating the current pool of solutions, a given solution may change
entirely from the previous solution. By mutating the solutions, a genetic algorithm can reach
an improved solution solely through the mutation operator.[1] Again, different methods of
mutation may be used; these range from a simple bit mutation (flipping random bits in a binary
string chromosome with some low probability) to more complex mutation methods, which may
replace genes in the solution with random values chosen from the uniform distribution or the
Gaussian distribution. As with the crossover operator, the mutation method is usually chosen to
match the representation of the solution within the chromosome.

Combining operators

While each operator acts to improve the solutions produced by the genetic algorithm
working individually, the operators must work in conjunction with each other for the algorithm
to be successful in finding a good solution. Using the selection operator on its own will tend to fill
the solution population with copies of the best solution from the population. If the selection
and crossover operators are used without the mutation operator, the algorithm will tend to
converge to a local minimum, that is, a good but sub- optimal solution to the problem. Using the
mutation operator on its own leads to a random walk through the search space. Only by using
all three operators together can the genetic algorithm become a noise-tolerant hill-climbing
algorithm, yielding good solutions to the problem

AI&ML 105 R.THANIGAIVEL


GA Algorithm – Classification of GA – Applications Genetic algorithms (GA) are
search algorithms based on the principles of natural selection and genetics, introduced by J
Holland in the 1970’s and inspired by the biological evolution of living beings. Genetic
algorithms abstract the problem space as a population of individuals, and try to explore the fittest
individual by producing generations iteratively. GA evolves a population of initial individuals
to a population of high quality individuals, where each individual represents a solution of the
problem to be solved. The quality of each rule is measured by a fitness function as the quantitative
representation of each rule’s adaptation to a certain environment. The procedure starts from
an initial population of randomly generated individuals.

During each generation, three basic genetic operators are sequentially applied to each
individual with certain probabilities, i.e. selection, crossover and mutation.

The GAs is computer program that simulate the heredity and evolution of living
organisms. An optimum solution is possible even for multi modal objective functions utilizing
GAs because they are multi-point search methods. Also, GAs is applicable to discrete search
space problems. Thus, GA is not only very easy to use but also a very powerful optimization
tool . In GA, the search space consists of strings, each of which representing a candidate
solution to the problem and are termed as chromosomes. The objective function value of each
chromosome is called its fitness value. Population is a set of chromosomes along with their
associated fitness. Generations are populations generated in an iteration of the GA. Genetic
algorithm to search a space of candidate solutions to identify the best one is as shown in the
diagram.

AI&ML 106 R.THANIGAIVEL


Operators in GA

GA searches for better solutions by genetic operations, including selection operation, crossover
operation and mutation operation.

A. Selection Operation Selection operation is to select elitist individuals as parents in current


population, which can generate offspring. Fitness values are used as criteria to judge whether
individuals are elitist. There are many methods how to select the best chromosomes, for
example roulette wheel selection, Boltzman selection, tournament selection, rank selection,
steady-state selection, elitism selection and some others. Some of them will be described
shortly.

1) Roulette Wheel Selection Parents are selected according to their fitness. The better the
chromosomes are, the more chances to be selected they have. Imagine a roulette wheel where
are placed all chromosomes in the population, every has its place big accordingly to its
fitness function like on the Figure 2. Chromosome with bigger fitness will be selected more
times.

2) Rank Selection The previous selection method will have problems when the fitness’s differ
very much. For example, if the best chromosome fitness is 90% of the entire roulette wheel,
then the other chromosomes will have very few chances to be selected. Rank selection first sorts
the population by fitness and then every chromosome receives fitness from this ranking. The
worst will have fitness 1, second worst 2 etc. and the best will have fitness N (number of
chromosomes in population). After this, all the chromosomes have a chance to be selected. The
probability that a chromosome will be selected is then proportional to its rank in this sorted list,
rather than its fitness. But this method can lead to slower convergence, because the best
chromosomes do not differ so much from other ones.

3) Elitism Selection When creating new population by crossover and mutation; we have a big
chance, that we will lose the best chromosome. Elitism is name of method, which first copies
the best chromosome (or a few best chromosomes) to new population. The rest is done in
classical way. Elitism can very rapidly increase performance of GA, because it prevents

AI&ML 107 R.THANIGAIVEL


losing the best found solution.

B. Crossover Operations

The generation of successors in a GA is determined by a set of operators that recombine and mutate
selected members of the current population. The two most common operators are crossover
and mutation. The crossover operator produces two new offspring from two parent strings, by
copying selected bits from each parent. The bit at position i in each offspring is copied from the bit
at position i in one of the two parents. The choice of which parent contributes the bit for position
i is determined by an additional string called the crossover mask. There are three types of
crossover operators, namely as single-point, two-point and uniform crossover.

1) Single-Point Crossover In the single-point crossover, the crossover mask is always


constructed so that it begins with a string containing n contiguous 1s, followed by the
necessary number of 0s to complete the string. This results in offspring in which the first n bits are
contributed by one parent and the remaining bits by the second parent. Each time the single-
point crossover operator is applied, the crossover point n is chosen at random, and the
crossover mask is then created and applied. To illustrate, consider the single-point crossover
operator at the top of the figure and consider the topmost of the two offspring in this case. This
offspring takes its first five bits from the first parent and its remaining six bits from the second
parent, because the crossover mask 11111000000 specifies these choices for each of the bit
positions. The second offspring uses the same crossover mask, but switches the roles of the two
parents. Therefore, it contains the bits that were not used by the first offspring.

2) Two-Point Crossover In two-point crossover, offspring are created by substituting


intermediate segments of one parent into the middle of the second parent string. Put another
way, the crossover mask is a string beginning with n0 zeros, followed by a contiguous string
of nl ones, followed by the necessary number of zeros to complete the string. Each time the
twopoint crossover operator is applied, a mask is generated by randomly choosing the integers
n0 and nl. For instance, in the example shown in Figure

AI&ML 108 R.THANIGAIVEL


3) 10 the offspring are created using a mask for which n0= 2 and nl = 5. Again, the two
offspring are created by switching the roles played by the two parents.

3) Uniform Crossover Uniform crossover combines bits sampled uniformly from the two
parents, as illustrated in Figure 3. In this case the crossover mask is generated as a
random bit string with each bit chosen at random and independent of the others

C. Mutation Operations

In addition to recombination operators that produce offspring by combining parts of two parents, a
second type of operator produces offspring from a single parent. In particular, the mutation
operator produces small random changes to the bit string by choosing a single bit at random, then
changing its value. Mutation is often performed after crossover as in Figure 3.

Variants

Messy Genetic Algorithms

In a “classical” genetic algorithm, the genes are encoded in a fixed order. The meaning of a single
gene is determined by its position inside the string. We have seen in the previous chapter that
a genetic algorithm is likely to converge well if the optimization task can be divided into several
short building blocks. What, however, happens if the coding is chosen such that couplings
occur between distant genes? Of course, one- point crossover tends to disadvantage long
schemata (even if they have low order) over short ones.

AI&ML 109 R.THANIGAIVEL


Positional preference: Genes with index 1 and 6 occur twice, the first occurrence Messy

genetic algorithms try to overcome this difficulty by using a variable-


length, position-independent coding. The key idea is to append an index to each gene which
allows to identify its position [14, 15]. A gene, therefore, is no longer represented as a single allele
value and a fixed position, but as a pair of an index and an allele. Figure
4.1 shows how this “messy” coding works for a string of length 6.

Since the genes can be identified uniquely by the help of the index, genes may swapped
arbitrarily without changing the meaning of the string. With appropriate genetic operations,
which also change the order of the pairs, the GA could possibly group coupled genes together
automatically.

Due to the free arrangement of genes and the variable length of the encoding, we can,
however, run into problems which do not occur in a simple GA. First of all, it can happen that
there are two entries in a string which correspond to the same index, but have conflicting alleles.
The most obvious way to overcome this “over-specification” is positional preference—the first
entry which refers to a gene is taken. Figure 4.2 shows an example.

The reader may have observed that the genes with indices 3 and 5 do not occur at all in
the example in Figure 4.2. This problem of “underspecification” is more complicated and
its solution is not as obvious as for over-specification. Of course, a lot of variants are reasonable.
One approach could be to check all possible combinations and to take the best one (for k
missing genes, there are 2k combinations). With the objective to reduce this effort, have
suggested to use so-called templates for finding specifications for k missing genes. It is nothing
else than applying a local hill climbing method with random initial value to the k missing genes.

AI&ML 110 R.THANIGAIVEL


While messy GAs usually work with the same mutation operator as simple GAs (every
allele is altered with a low probability pM), the crossover operator is replaced by a more general
cut and splice operator which also allows to mate parents with different lengths. The
basic idea is to choose cut sites for both parents independently and to splice the four fragments.
Figure 4.3 shows an example

Alternative Selection Scheme

Depending on the actual problem, other selection schemes than the roulette wheel can
be useful:

Linear rank selection: In the beginning, the potentially good individuals sometimes
fill the population too fast which can lead to premature convergence into local maxima. On the
other hand, refinement in the end phase can be slow since the individuals have similar fitness
values. These problems can be overcome by taking the rank of the fitness values as the basis for
selection instead of the values themselves.

Tournament selection: Closely related to problems above, it can be better not to use the
fitness values themselves. In this scheme, a small group of individuals is sampled from the
population and the individual with best fitness is chosen for reproduction. This selection
scheme is also applicable when the fitness function is given in implicit form, i.e. when we only
have a comparison relation which determines which of two given individuals is better.

Moreover, there is one “plug-in” which is frequently used in conjunction with any of the
three selection schemes we know so far—elitism. The idea is to avoid that the observed best-
fitted individual dies out just by selecting it for the next generation without any random
experiment. Elitism is widely used for speeding up the convergence

AI&ML 111 R.THANIGAIVEL


of a GA. It should, however, be used with caution, because it can lead to premature
convergence.

Adaptive Genetic Algorithms Adaptive genetic algorithms are GAs whose parameters, such as
the population size, the crossing over probability, or the mutation probability are varied while
the GA is running (e.g. see [8]). A simple variant could be the following: The mutation rate is
changed according to changes in the population; the longer the population does not improve,
the higher the mutation rate is chosen. Vice versa, it is decreased again as soon as an
improvement of the population occurs.

Hybrid Genetic Algorithms

As they use the fitness function only in the selection step, genetic algorithms are blind optimizers
which do not use any auxiliary information such as derivatives or other specific knowledge
about the special structure of the objective function. If there is such knowledge, however, it is
unwise and inefficient not to make use of it. Several investigations have shown that a lot of
synergism lies in the combination of genetic algorithms and conventional methods. The basic
idea is to divide the optimization task into two complementary parts. The coarse, global
optimization is done by the GA while local refinement is done by the conventional method (e.g.
gradient-based, hill climbing, greedy algorithm, simulated annealing, etc.). A number of variants
is reasonable:

1. The GA performs coarse search first. After the GA is completed, local


refinement is done.

2. The local method is integrated in the GA. For instance, every K generations, the
population is doped with a locally optimal individual.

3. Both methods run in parallel: All individuals are continuously used as initial values
for the local method. The locally optimized individuals are re-implanted into the current
generation.

Self-Organizing Genetic Algorithms

As already mentioned, the reproduction methods and the representations of the genetic
material were adapted through the billions of years of evolution [25]. Many of these adaptations
were able to increase the speed of adaptation of the individuals. We have seen several times that
the choice of the coding method and the genetic operators is crucial for the convergence of a GA.
Therefore, it is promising not to encode only the raw genetic information, but also some additional
information, for example, parameters of the coding function or the genetic operators. If this is done
properly, the GA could find its own optimal way for representing and manipulating data
automatically.

AI&ML 112 R.THANIGAIVEL


Applications:

Genetic algorithms (GAs) are search heuristics inspired by the process of natural selection.
They are used to solve optimization and search problems by mimicking the process of
biological evolution. Here are some notable applications of genetic algorithms:

1. Optimization Problems

Function Optimization: GAs are used to find the maximum or minimum of complex functions
in engineering, economics, and operations research.

Parameter Tuning: They help in tuning hyperparameters for machine learning models to enhance
performance.

2. Engineering Design

Structural Design: Used in designing structures such as bridges and aircraft to optimize for
weight, strength, and cost.

Control Systems: Optimizing control parameters for systems in robotics, aerospace, and
manufacturing.

3. Machine Learning

Feature Selection: Identifying the most relevant features in large datasets to improve model
accuracy and reduce computation time.

Neural Network Training: Optimizing the architecture and weights of neural networks.

4. Scheduling

Job Scheduling: Allocating resources and scheduling jobs in manufacturing, project


management, and computing.

Timetabling: Creating optimal schedules for schools, universities, and events.

5. Game Playing and Artificial Intelligence

Strategy Development: Developing strategies for board games, video games, and
simulations.

Evolving Agents: Creating AI agents that can learn and adapt to new environments.

6. Biotechnology

DNA Sequencing: Aligning DNA sequences and finding motifs in bioinformatics. Drug

Discovery: Optimizing molecular structures for drug design.

AI&ML 113 R.THANIGAIVEL


1. Robotics

Path Planning: Determining the optimal path for robots in dynamic environments.

Evolutionary Robotics: Designing robot structures and behaviors through evolutionary processes.

2. Finance

Portfolio Optimization: Allocating assets in investment portfolios to maximize returns and


minimize risk.

Algorithmic Trading: Developing trading strategies that adapt to market conditions.

3. Telecommunications

Network Design: Optimizing the layout and operation of communication networks. Routing

Algorithms: Improving data routing efficiency in networks.

4. Image and Signal Processing

Image Segmentation: Dividing an image into segments for analysis. Signal

Filtering: Optimizing filters for signal processing applications.

5. Environmental Science

Resource Management: Optimizing the use of natural resources like water and forests. Pollution

Control: Designing strategies to minimize environmental impact.

6. Art and Music

Generative Art: Creating artworks through evolutionary processes.

Music Composition: Composing music using evolutionary algorithms to innovate new styles
and forms.

Key Advantages of Genetic Algorithms

Global Search Capability: They can avoid local optima and find global solutions. Adaptability: GAs

can adapt to changing environments or requirements.

Parallelism: They are naturally parallel and can be implemented on parallel hardware for faster
computations.

Challenges and Considerations

Computational Cost: GAs can be computationally intensive, especially for large problems.

Parameter Sensitivity: The performance of GAs can be sensitive to the choice of parameters

AI&ML 114 R.THANIGAIVEL


like population size, mutation rate, and crossover rate.

Premature Convergence: GAs can sometimes converge to suboptimal solutions if not

properly control

AI&ML 115 R.THANIGAIVEL


UNIT V

ADVANCES AND APPLICATIONS

Support Vector Machine

Support Vector Machine (SVM) is one of the Machine Learning (ML) Supervised algorithms.
There are plenty of algorithms in ML, but still, reception for SVM is always special because of
its robustness while dealing with the data. So here in this article, we will be covering almost
all the necessary things that need to drive for any kind of data w.r.t SVM

1. SVM – Comes under Supervised ML

2. SVM can perform both Classification & Regression

3. Goal – Create the best decision boundary that can segregate n- dimensional space
into classes so that we can easily put the new data points in the correct category –
Hyperplane.

4. Out-of-the-box classifier

AI&ML 116 R.THANIGAIVEL

You might also like