0% found this document useful (0 votes)
18 views86 pages

Machine Learning Moudle - 1: There Are Three Main Types of Machine Learning

The document provides an introduction to machine learning, describing the three main types of machine learning and key concepts. It also discusses perspectives and issues in machine learning, the steps for designing learning systems, concepts of hypotheses, version space, and performance metrics for evaluating machine learning models.

Uploaded by

veereshs078
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views86 pages

Machine Learning Moudle - 1: There Are Three Main Types of Machine Learning

The document provides an introduction to machine learning, describing the three main types of machine learning and key concepts. It also discusses perspectives and issues in machine learning, the steps for designing learning systems, concepts of hypotheses, version space, and performance metrics for evaluating machine learning models.

Uploaded by

veereshs078
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

MACHINE LEARNING

MOUDLE – 1
INTRODUCTION :
Machine learning is a subfield of artificial intelligence that involves developing
algorithms and models that enable computers to learn from data and make predictions
or decisions without being explicitly programmed.
In traditional programming, a set of rules or instructions is provided to a computer to
execute a particular task. In machine learning, the computer is trained on a large
dataset and uses statistical techniques to identify patterns and relationships in the
data. The model is then used to make predictions or decisions on new, unseen data.

There are three main types of machine learning:


1. Supervised learning - In this type of learning, the algorithm is trained on labeled
data, which means the input data is already paired with the correct output. The model
learns to make predictions by finding patterns in the input-output pairs.
2. Unsupervised learning - In this type of learning, the algorithm is trained on
unlabeled data, which means there are no pre-defined output values. The model learns
to find patterns and relationships in the data without being explicitly told what to look
for.
3. Reinforcement learning - In this type of learning, the algorithm learns by interacting
with an environment and receiving feedback in the form of rewards or punishments.
The model learns to make decisions that maximize its reward over time.

PERSPECTIVES AND ISSUES IN MACHINE LEARNING :


PERSPECTIVES :
1. Automation: Machine learning has the potential to automate tasks that were
previously done by humans, increasing efficiency and productivity.
2. Personalization: Machine learning can be used to personalize experiences for
individual users, such as personalized recommendations on e-commerce websites or
personalized news feeds.
3. Innovation: Machine learning can lead to the development of new products and
services that were previously not possible, such as self-driving cars and personalized
medicine.

ISSUES :
1. Bias: Machine learning algorithms can inherit biases from the data they are trained
on, which can lead to unfair or discriminatory outcomes.
2. Transparency: Machine learning algorithms can be complex and difficult to
understand, making it hard to explain how decisions are being made.
3. Privacy: Machine learning often involves the use of personal data, raising concerns
about how that data is collected, stored, and used.
4. Security: Machine learning models can be vulnerable to attacks, such as data
poisoning or model inversion attacks, which can compromise the security of the
system.
5. Accountability: Machine learning algorithms can make decisions that have
significant impacts on people's lives, raising questions about who is responsible for the
outcomes and how to hold them accountable.

DESIGNING LEARNING SYSTEMS :


Steps for Designing Learning System are:

Step 1) Choosing the Training Experience: The very important and first task is to
choose the training data or training experience which will be fed to the Machine
Learning Algorithm. It is important to note that the data or experience that we fed to
the algorithm must have a significant impact on the Success or Failure of the Model.
So Training data or experience should be chosen wisely.
Step 2- Choosing target function: The next important step is choosing the target
function. It means according to the knowledge fed to the algorithm the machine
learning will choose NextMove function which will describe what type of legal moves
should be taken.
For example : While playing chess with the opponent, when opponent will play then
the machine learning algorithm will decide what be the number of possible legal
moves taken in order to get success.

Step 3- Choosing Representation for Target function: When the machine


algorithm will know all the possible legal moves the next step is to choose the
optimized move using any representation i.e. using linear Equations, Hierarchical
Graph Representation, Tabular form etc. The NextMove function will move the Target
move like out of these move which will provide more success rate.
For Example : while playing chess machine have 4 possible moves, so the machine
will choose that optimized move which will provide success to it.

Step 4- Choosing Function Approximation Algorithm: An optimized move cannot


be chosen just with the training data. The training data had to go through with set of
example and through these examples the training data will approximates which steps
are chosen and after that machine will provide feedback on it. For Example : When a
training data of Playing chess is fed to algorithm so at that time it is not machine
algorithm will fail or get success and again from that failure or success it will measure
while next move what step should be chosen and what is its success rate.

Step 5- Final Design: The final design is created at last when system goes from
number of examples , failures and success , correct and incorrect decision and what
will be the next step etc. Example: DeepBlue is an intelligent computer which is ML-
based won chess game against the chess expert Garry Kasparov, and it became the
first computer which had beaten a human chess expert.

CONCEPTS OF HYPOTHESES :

In machine learning, a hypothesis is a proposed explanation or model for a set of data.


The goal of machine learning is to use available data to develop a hypothesis that
accurately predicts future observations or outcomes.

In supervised learning, a hypothesis takes the form of a function that maps input
variables to output variables. The function is learned from a set of labeled examples,
where the input variables and their corresponding output variables are known. The
goal is to find the function that best fits the training data, and that can be used to
predict the output variables for new, unseen input variables.

In unsupervised learning, a hypothesis takes the form of a model that captures the
underlying structure or patterns in the data. The model is learned from unlabeled data,
and the goal is to identify the structure or patterns that can be used to explain the
data.

Hypothesis space (H):

Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known as a
hypothesis set. It is used by supervised machine learning algorithms to determine the best possible
hypothesis to describe the target function or best maps input to output.

It is often constrained by choice of the framing of the problem, the choice of model, and the choice of
model configuration.

Hypothesis (h):

It is defined as the approximate function that best describes the target in supervised
machine learning algorithms. It is primarily based on data as well as bias and restrictions
applied to data.

Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper
output and can be evaluated as well as used to make predictions.

The hypothesis (h) can be formulated in machine learning as follows:

y= mx + b

Where,

Y: Range

m: Slope of the line which divided test data or changes in y divided by change in x.

x: domain

c: intercept (constant)

VERSION SPACE :
A concept is complete if it covers all positive examples. A concept is consistent if it
covers none of the negative examples. The version space is the set of all complete and
consistent concepts. This set is convex and is fully defined by its least and most general
elements. The key idea in the CANDIDATE-ELIMINATION algorithm is to output a
description of the set of all hypotheses consistent with the training examples.
Representation :
The CANDIDATE-ELIMINATION algorithm finds all describable hypotheses that are
consistent with the observed training examples. In order to define this algorithm
precisely, we begin with a few basic definitions.
First, let us say that a hypothesis is consistent with the training examples if it correctly
classifies these examples.
Definition: A hypothesis h is consistent with a set of training examples D if and only if
h(x) = c(x) for each example (x, c(x)) in D.
Notice the key difference between this definition of consistent and our earlier
definition of satisfies. An example x is said to satisfy hypothesis h when h(x) = 1,
regardless of whether x is a positive or negative example of the target concept.
However, whether such an example is consistent with h depends on the target concept,
and in particular, whether h(x) = c(x).
The CANDIDATE-ELIMINATION algorithm represents the set of all hypotheses
consistent with the observed training examples. This subset of all hypotheses is called
the version space with respect to the hypothesis space H and the training examples D,
because it contains all plausible versions of the target concept.
Definition: The version space, denoted VSHVD, with respect to hypothesis space H
and training examples D, is the subset of hypotheses from H consistent with the
training examples in D.
♦ V S h, d = {h E H I Consistent (h, D)]

PERPORMANCE MATRICS :

Confusion Matrix
It is the easiest way to measure the performance of a classification problem where the
output can be of two or more type of classes. A confusion matrix is nothing but a table
with two dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions
have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives
(FN)” as shown below −
False Positives (FP)

Explanation of the terms associated with confusion matrix are as follows −


 True Positives (TP) − It is the case when both actual class & predicted class of
data point is 1.
 True Negatives (TN) − It is the case when both actual class & predicted class of
data point is 0.
 False Positives (FP) − It is the case when actual class of data point is 0 & predicted
class of data point is 1.
 False Negatives (FN) − It is the case when actual class of data point is 1 &
predicted class of data point is 0.

Accuracy
Accuracy is a fraction of predictions our model got right .
Accuracy= number of correct predictions / total number of predictions
I.e., Accuracy = TP+TN / TP+TN+FP+FN
Precision
Precision, used in document retrievals, may be defined as the number of correct
documents returned by our ML model. We can easily calculate it by confusion matrix
with the help of following formula −
Precision=TP/TP+FP
Recall or Sensitivity
Recall may be defined as the number of positives returned by our ML model. We can
easily calculate it by confusion matrix with the help of following formula −
Recall=TP/TP+FN
Specificity
Specificity, in contrast to recall, may be defined as the number of negatives returned by
our ML model. We can easily calculate it by confusion matrix with the help of following
formula −
Specificity=TN/TN+FP
AUC (Area Under ROC curve):
AUC (Area Under Curve)-ROC (Receiver Operating Characteristic) is a performance
metric, based on varying threshold values, for classification problems. As name suggests,
ROC is a probability curve and AUC measure the separability. In simple words, AUC-ROC
metric will tell us about the capability of model in distinguishing the classes. Higher the
AUC, better the model.
Mathematically, it can be created by plotting TPR (True Positive Rate) i.e. Sensitivity or
recall vs FPR (False Positive Rate) i.e. 1-Specificity, at various threshold values. Following
is the graph showing ROC, AUC having TPR at y-axis and FPR at x-axis −

We can use roc_auc_score function of sklearn.metrics to compute AUC-ROC.


ROC curve

An ROC curve (receiver operating characteristic curve) is a graph showing the


performance of a classification model at all classification thresholds. This curve plots
two parameters:

 True Positive Rate


 False Positive Rate

BIAS VARIANCE DECOMPOSITION :


BIAS :
While making predictions, a difference occurs between prediction values made by
the model and actual values/expected values, and this difference is known as bias
errors or Errors due to bias.

o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias
model also cannot perform well on new data.

Low variance means there is a small variation in the prediction of the target function
with changes in the training data set.
At the same time, High variance shows a large variation in the prediction of the target
function with changes in the training dataset.

DECISION TREE LEARNING :

o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
o It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents
the outcome.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
OR

What is a Decision Tree?

A Supervised Machine Learning Algorithm, used to build classification and regression


models in the form of a tree structure.

A decision tree is a tree where each -


 Node - a feature(attribute)
 Branch - a decision(rule)
 Leaf - an outcome(categorical or continuous)

Decision Tree for Boolean Functions Machine Learning:

Every Variable in Boolean function such as A, B, C etc. has two possibilities that is True
and False

Every Boolean function is either True or False

If the Boolean function is true we write YES (Y)

If the Boolean function is False we write NO (N)


What is an ID3 Algorithm?
ID3 stands for Iterative Dichotomiser 3
It is a classification algorithm that follows a greedy approach by selecting a best
attribute that yields maximum Information Gain(IG) or minimum Entropy(H).

What is Entropy and Information gain?


Entropy is a measure of the amount of uncertainty in the dataset S. Mathematical
Representation of Entropy is shown here -
H(S)= ∑-P(i)* log₂(p(i))
Where,

 S- The current dataset for which entropy is being calculated(changes every iteration of
the ID3 algorithm).
 I- Set of classes in S {example - I ={yes, no}}
 p(i) - The proportion of the number of elements in class I to the number of elements in
set S.
In ID3, entropy is calculated for each remaining attribute. The attribute with the
smallest entropy is used to split the set S on that particular iteration.

Entropy = 0 implies it is of pure class, that means all are of same category.

Information Gain IG(A) tells us how much uncertainty in S was reduced after
splitting set S on attribute A. Mathematical representation of Information gain is
shown here –

Where,

 H(S) - Entropy of set S.


 T - The subsets created from splitting set S by attribute A
 p(t) - The proportion of the number of elements in t to the number of elements in set
S.
 H(t) - Entropy of subset t.
In ID3, information gain can be calculated (instead of entropy) for each remaining
attribute. The attribute with the largest information gain is used to split the set S on
that particular iteration.

What are the steps in ID3 algorithm?


The steps in ID3 algorithm are as follows:
1. Calculate entropy for dataset.
2. For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
3. Find the feature with maximum information gain.
4. Repeat it until we get the desired tree.
ID3 ALGORITHM :
ID3(Examples, Target_attribute, Attributes)

 Examples are the training examples.


 Target attribute is the attribute whose value is to be predicted by the tree.
 Attributes is a list of other attributes that may be tested by the learned decision
tree. Returns a decision tree that correctly classifies the given Examples.
 Create a Root node for the tree

 If all Examples are positive, Return the single-node tree Root, with label = +/yes
 If all Examples are negative, Return the single-node tree Root, with label= -/No
 If Attributes is empty, Return the single-node tree Root, with label = most
common value of Target_attribute in Examples .
 Otherwise Begin

o A <---the attribute from Attributes that best* classifies examples


o The decision attribute for Roor--> A

o For each possible value, Vi of A,


 Add a new tree branch belowRoot,corresponding to the test A=Vi

 Let Examples, Vi be the subset of Examples that have value Vi for A


 If Examples Vi, is empty

Then below this new branch add a leaf node with label = most
common value of Target_attribute in examples

 Else below this new branch add the subtree


 ID3(Examples, Target_attribute, Attributes - {A}))

 End
 Return Root
Here,dataset is of binary classes(yes and no), where 9 out of 14 are "yes" and 5 out of 14
are "no".

Complete entropy of dataset is:

H(S) = - p(yes ) * log2(p(yes )) - p(no) * log2(p(no))


= - (9/14) * log2(9/14) - (5/14) * log2(5/14)
= - (-0.41) - (-0.53)
= 0.94

For each attribute of the dataset, let's follow the step-2 of pseudocode : -

First Attribute - Outlook

Categorical values - sunny, overcast and rain


H(Outlook=sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971
H(Outlook=rain) = -(3/5)*log(3/5)-(2/5)*log(2/5) =0.971
H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0

Average Entropy Information for Outlook -


I(Outlook) = p(sunny) * H(Outlook=sunny) + p(rain) * H(Outlook=rain) + p(overcast) *
H(Outlook=overcast)
= (5/14)*0.971 + (5/14)*0.971 + (4/14)*0
= 0.693

Information Gain = H(S) - I(Outlook)


= 0.94 - 0.693
= 0.2

Second Attribute - Temperature


Categorical values - hot, mild, cool
H(Temperature=hot) = -(2/4)*log(2/4)-(2/4)*log(2/4) = 1
H(Temperature=cool) = -(3/4)*log(3/4)-(1/4)*log(1/4) = 0.811
H(Temperature=mild) = -(4/6)*log(4/6)-(2/6)*log(2/6) = 0.9179
Average Entropy Information for Temperature -
I(Temperature) = p(hot)*H(Temperature=hot) + p(mild)*H(Temperature=mild) +
p(cool)*H(Temperature=cool)
= (4/14)*1 + (6/14)*0.9179 + (4/14)*0.811
= 0.9108

Information Gain = H(S) - I(Temperature)


= 0.94 - 0.9108
= 0.0292

Third Attribute - Humidity

Categorical values - high, normal


H(Humidity=high) = -(3/7)*log(3/7)-(4/7)*log(4/7) = 0.983
H(Humidity=normal) = -(6/7)*log(6/7)-(1/7)*log(1/7) = 0.591

Average Entropy Information for Humidity -


I(Humidity) = p(high)*H(Humidity=high) + p(normal)*H(Humidity=normal)
= (7/14)*0.983 + (7/14)*0.591
= 0.787

Information Gain = H(S) - I(Humidity)


= 0.94 - 0.787
= 0.153

Fourth Attribute - Wind

Categorical values - weak, strong


H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811
H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1

Average Entropy Information for Wind -


I(Wind) = p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong)
= (8/14)*0.811 + (6/14)*1
= 0.892

Information Gain = H(S) - I(Wind)


= 0.94 - 0.892
= 0.048

Here, the attribute with maximum information gain is Outlook. So, the decision tree built
so far -

Here, when Outlook == overcast, it is of pure class(Yes).

Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset
rows = [1, 2, 8, 9, 11]}.

Complete entropy of Sunny is -


H(S) = - p(yes ) * log2(p(yes )) - p(no) * log2(p(no))
= - (2/5) * log2(2/5) - (3/5) * log2(3/5)
= 0.971

First Attribute - Temperature

Categorical values - hot, mild, cool


H(Sunny, Temperature=hot) = -0-(2/2)*log(2/2) = 0
H(Sunny, Temperature=cool) = -(1)*log(1)- 0 = 0
H(Sunny, Temperature=mild) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
Average Entropy Information for Temperature -
I(Sunny, Temperature) = p(Sunny, hot)*H(Sunny, Temperature=hot) + p(Sunny, mild)*H(Sunny,
Temperature=mild) + p(Sunny, cool)*H(Sunny, Temperature=cool)
= (2/5)*0 + (1/5)*0 + (2/5)*1
= 0.4

Information Gain = H(Sunny) - I(Sunny, Temperature)


= 0.971 - 0.4
= 0.571
Second Attribute - Humidity

Categorical values - high, normal


H(Sunny, Humidity=high) = - 0 - (3/3)*log(3/3) = 0
H(Sunny, Humidity=normal) = -(2/2)*log(2/2)-0 = 0

Average Entropy Information for Humidity -


I(Sunny, Humidity) = p(Sunny, high)*H(Sunny, Humidity=high) + p(Sunny, normal)*H(Sunny,
Humidity=normal)
= (3/5)*0 + (2/5)*0
= 0

Information Gain = H(Sunny) - I(Sunny, Humidity)


= 0.971 - 0
= 0.971

Third Attribute - Wind

Categorical values - weak, strong


H(Sunny, Wind=weak) = -(1/3)*log(1/3)-(2/3)*log(2/3) = 0.918
H(Sunny, Wind=strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1

Average Entropy Information for Wind -


I(Sunny, Wind) = p(Sunny, weak)*H(Sunny, Wind=weak) + p(Sunny, strong)*H(Sunny, Wind=strong)
= (3/5)*0.918 + (2/5)*1
= 0.9508

Information Gain = H(Sunny) - I(Sunny, Wind)


= 0.971 - 0.9508
= 0.0202

Here, the attribute with maximum information gain is Humidity. So, the decision tree built
so far -
Here, when Outlook = Sunny and Humidity = High, it is a pure class of category "no". And
When Outlook = Sunny and Humidity = Normal, it is again a pure class of category "yes".
Therefore, we don't need to do further calculations.

Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset
rows = [4, 5, 6, 10, 14]}.

Complete entropy of Rain is -


H(S) = - p(yes ) * log2(p(yes )) - p(no) * log2(p(no))
= - (3/5) * log(3/5) - (2/5) * log(2/5)
= 0.971

First Attribute - Temperature

Categorical values - mild, cool


H(Rain, Temperature=cool) = -(1/2)*log(1/2)- (1/2)*log(1/2) = 1
H(Rain, Temperature=mild) = -(2/3)*log(2/3)-(1/3)*log(1/3) = 0.918
Average Entropy Information for Temperature -
I(Rain, Temperature) = p(Rain, mild)*H(Rain, Temperature=mild) + p(Rain, cool)*H(Rain,
Temperature=cool)
= (2/5)*1 + (3/5)*0.918
= 0.9508

Information Gain = H(Rain) - I(Rain, Temperature)


= 0.971 - 0.9508
= 0.0202

Second Attribute - Wind

Categorical values - weak, strong


H(Wind=weak) = -(3/3)*log(3/3)-0 = 0
H(Wind=strong) = 0-(2/2)*log(2/2) = 0

Average Entropy Information for Wind -


I(Wind) = p(Rain, weak)*H(Rain, Wind=weak) + p(Rain, strong)*H(Rain, Wind=strong)
= (3/5)*0 + (2/5)*0
= 0

Information Gain = H(Rain) - I(Rain, Wind)


= 0.971 - 0
= 0.971
Here, the attribute with maximum information gain is Wind. So, the decision tree built so
far -

Here, when Outlook = Rain and Wind = Strong, it is a pure class of category "no". And
When Outlook = Rain and Wind = Weak, it is again a pure class of category "yes".
And this is our final desired tree for the given dataset.
Issues in Decision tree learning :
 Avoiding overfitting the data.
 Incorporation continue valued attributes.
 Alternative measure for selecting attributes.
 Handling training examples with missing attribute value.
 Handling attribute with differing cost.
INDUCTIVE BIAS IN DECISION TREE :
 Given a collection of training example there are typically many decision trees
consistent with these example
 Describing the inducting bias of ID3 therefore consists of describing the bias by
which it chooses one of these consistent hypothesis over the others.
 Which of these decision trees does ID3 choose.
 It chooseses the first acceptable tree it encounters in its simple-2-complex ,Hill
climbing search through the space possible trees.
1)Approximate, inductive bias of ID3.
o shorter trees are preferred over longer trees.
2)Closer approximation to the inductive bias of ID3.
o shorter trees are preferred over longer trees.
3)Trees that place high information attributes close the root are preferred
over those that do not.
FIND –S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS :
How can we use the more-general-than partial ordering to organize the search for a
hypothesis consistent with the observed training examples? One way is to begin with
the most specific possible hypothesis in H, then generalize this hypothesis each time it
fails to cover an observed positive training example .To be more precise about how the
partial ordering is used, consider the FIND-S algorithm.
Algorithm
 Initialize h to the most specific hypothesis in H
 For each positive training instance x
o For each attribute constraints Ai in h
 If the constraint ai is satisfied by x
then do Nothing
 Else replace a i in h by the next more general constraints that is
satisfied x
 output hypothesis h
CANDIDATE – ELIMINATION LEARNING ALGORITHM :
The CANDIDATE-ELIMINATION algorithm computes the version space containing all
hypotheses from H that are consistent with an observed sequence of training
examples.
It begins by initializing the version space to the set of all hypotheses in H; that is, by
initializing the G boundary set to contain the most general hypothesis in H
Go<---- { (?, ?, ?, ?, ?, ?) }
and initializing the S boundary set to contain the most specific (least general)
hypothesis
So<---{ (Ø,Ø,Ø,Ø,Ø,Ø) }
These two boundary sets delimit the entire hypothesis space, because every other
hypothesis in H is both more general than So and more specific than Go. As each
training example is considered, the S and G boundary sets are generalized and
specialized, respectively, to eliminate from the version space any hypotheses found
inconsistent with the new training example. After all examples have been processed,
the computed version space contains all the hypotheses consistent with these
examples and only these hypotheses.
This algorithm is summarized.
 Initialize G to the set of maximally general hypotheses in H.
 Initialize S to the set of maximally specific hypotheses in H
 For each training example d, do
o If d is a positive example
♦ Remove from G any hypothesis inconsistent with d
♦ For each hypothesis s in S that is not consistent with d
 Remove s from S
 Add to S all minimal generalizations h of s such that
 h is consistent with d, and some member of G is
more general than h
 Remove from S any hypothesis that is more general
than another hypothesis in S
o If d is a negative example
♦ Remove from S any hypothesis inconsistent with d
♦ For each hypothesis g in G that is not consistent with d
 Remove g from G
Add to G all minimal specializations h of g such that
 h is consistent with d, and some member of S is
more specific than h
 Remove from G any hypothesis that is less general than
another hypothesis in G

Hypothesis: A sub position or proposed explanation made on the basis of limited


evidence as a starting hypothesis.
Overfitting :
Overfitting is an undesirable machine learning behaviour that occurs when the machine
learning model gives accurate predictions for training data but not for new data. When
data scientists use machine learning models for making predictions, they first train the
model on a known data set. Then, based on this information, the model tries to predict
outcomes for new data sets. An over-fit model can give inaccurate predictions and
cannot perform well for all types of new data.

SOLUTION FOR OVERFITTING:


1. Early stopping
Early stopping pauses the training phase before the machine learning model
learns the noise in the data. However, getting the timing right is important; else
the model will still not give accurate results.
2. Pruning
Feature selection—or pruning—identifies the most important features within the
training set and eliminates irrelevant ones.
For example, to predict if an image is an animal or human, you can look at
various input parameters like face shape, ear position, body structure, etc. You
may prioritize face shape and ignore the shape of the eyes.
3. Regularization
Regularization is a collection of training/optimization techniques that seek to
reduce overfitting. These methods try to eliminate those factors that do not
impact the prediction outcomes by grading features based on importance.
For example, mathematical calculations apply a penalty value to features with
minimal impact. Consider a statistical model attempting to predict the housing
prices of a city in 20 years. Regularization would give a lower penalty value to
features like population growth and average annual income but a higher penalty
value to the average annual temperature of the city.
4. Ensembling
Ensembling combines predictions from several separate machine learning
algorithms. Some models are called weak learners because their results are often
inaccurate.
Ensemble methods combine all the weak learners to get more accurate results.
They use multiple models to analyze sample data and pick the most accurate
outcomes.
The two main ensemble methods are bagging and boosting. Boosting trains
different machine learning models one after another to get the final result, while
bagging trains them in parallel.
5. Data augmentation
Data augmentation is a machine learning technique that changes the sample
data slightly every time the model processes it.
You can do this by changing the input data in small ways. When done in
moderation, data augmentation makes the training sets appear unique to the
model and prevents the model from learning their characteristics.
For example, applying transformations such as translation, flipping, and rotation
to input images.

DEALING WITH CONTINUES VALUES


These are the steps:
 Pool data into N bins to build a histogram of the feature. ...
 Replace the input feature with the bin index and treat it as a categorical input
(i.e. which bin does this value correspond to?);
 Use a lookup table to extract dense vector embeddings for each bin;
 Pass the vector embedding downstream.
Machine Learning
Unit 2
Supervised Learning with KNN, ANN and SVM

Instance based Learning


❖K Nearest Neighbours
❖ The most basic instance-based method is the k-NEAREST
NEIGHBOUR algorithm.
❖ K nearest neighbours stores all available cases and classifies new
cases based on a similarity measure.
❖ One of the top data mining algorithms used today.
❖ The nearest neighbours of an instance are defined in terms of the
standard Euclidean distance.
Let us assume that x1,y1 & x2,y2 are the two points in two-dimensional
space.
Here is the Euclidean distance formula is given by:
d =√[(x2 – x1)^2 + (y2 – y1)^2]

Training algorithm:
● For each training example (x, f (x)), add the example to the list
training examples

Classification algorithm:

● Given a query instance xq to be classified,


● Let x1…….xk denote the k instances from training examples that are
nearest to xq
● Return

where δ(a, b) = 1 if a = b and where δ(a, b) = 0 otherwise.


Problems:
Artificial Neural Networks
➔Introduction

Artificial Neural Networks (ANNs) are a class of machine learning


algorithms inspired by the structure and function of the human brain.
● ANNs consist of interconnected nodes or neurons that process and transmit
information. The nodes are organized into layers, with each layer
performing a specific function in the network's computation.
● The basic building block of an ANN is the neuron, which receives input
signals from other neurons or external sources, processes them using a
non-linear function, and produces an output signal that can be passed to
other neurons or used as the final output of the network. The connections
between neurons are weighted, and the weights are adjusted during
training to optimize the network's performance on a given task.
● ANNs can be used for a wide range of tasks, including classification,
regression, and pattern recognition. They have been successfully applied in
areas such as speech recognition, image recognition, natural language
processing, and game playing.
● Training an ANN typically involves feeding it a large set of training data and
adjusting the weights of the connections between neurons to minimize a
predefined loss function. This process is typically done using
backpropagation, which involves computing the gradient of the loss function
with respect to the network's weights and adjusting the weights accordingly
using gradient descent.
● ANNs can be designed with different architectures, including feedforward
networks, recurrent networks, convolutional networks, and others,
depending on the specific task and data being used. The choice of
architecture and hyperparameters can have a significant impact on the
performance of the network.
➔Perceptron

Perceptrons are one of the simplest types of artificial neural networks


used in machine learning. They were introduced by Frank Rosenblatt in the
1950s and are designed to perform binary classification tasks, i.e., to classify
input data into one of two categories.
● One type of ANN system is based on a unit called a perceptron, as
shown below. A perceptron takes a vector of real-valued inputs,
calculates a linear combination of these inputs, then outputs a 1 if the
result is greater than some threshold and -1 otherwise.
● More precisely, given inputs x1 through xn, the output o(x1, . . . , xn)
computed by the perceptron is

where each wi is a real-valued constant, or weight, that determines


the contribution of input xi to the perceptron output. Notice the quantity (-w0)
is a threshold that the weighted combination of inputs wixi+ . . . + wnxn must
surpass in order for the perceptron to output a 1.

The Perceptron Training Rule

● One way to learn an acceptable weight vector is to begin with random


weights, then iteratively apply the perceptron to each training example,
modifying the perceptron weights whenever it misclassifies an example.
● This process is repeated, iterating through the training examples as
many times as needed until the perceptron classifies all training
examples correctly.
● Weights are modified at each step according to the perceptron training
rule, which revises the weight wi associated with input xi according to
the rule

● Here t is the target output for the current training example.


● O is output generated by perceptron.
● ɳ is the positive constant called learning rate (usually set to some small
values like 0.1)
➔Multilayer Networks and Backpropagation

Backpropagation
1. Backpropagation, or backward propagation of errors, is an algorithm
that is designed to test for errors working back from output nodes to
input nodes. It is an important mathematical tool for improving the
accuracy of predictions in data mining and machine learning.

2.

Where
E = sum the errors over all of
the network output units
tkd and Okd
are the target and output values associated with the kth output unit
and training example d.
3. The learning problem faced by BACKPROPAGATION is to search a
large hypothesis space defined by all possible weight values for all the
units in the network
4. #28 Back Propagation Algorithm With Example Part-1 |ML|

Multilayer Networks

1. multilayer networks learned by the BACKPROPAGATION algorithm are


capable of expressing a rich variety of nonlinear decision

surfaces

2.a typical multilayer network and decision surface is depicted below:


3. from above example we can observe that Here the speech recognition task
involves distinguishing

among 10 possible vowels, all spoken in the context of "h-d" (i.e., "hid," "had,"

"head," "hood," etc.,

4 . The input speech signal is represented by two numerical parameters


obtained from a spectral analysis of the sound, allowing us to easily visualize
the decision surface over the two-dimensional instance space

5 . It is possible for the multilayer network to represent highly nonlinear


decision surfaces that are much more expressive than the linear decision
surfaces of single units

A Differentiable Threshold Unit:

1. multiple layers of cascaded linear units still produce only linear functions,
and we prefer networks capable of representing highly nonlinear functions.

2. the sigmoid unit-a unit very much like a perceptron, but based on a
smoothed, differentiable threshold function.
3. The sigmoid unit is illustrated below:

The sigmoid unit first computes a linear combination of its inputs, then applies
a threshold to the result.

4 . The threshold output is a continuous function of its input. More precisely,


the sigmoid unit computes its output o as

Where ,

σ = sigmoid function

the term e-y in the sigmoid function definition but sometimes replaced by e-k

➔Activation Units

In artificial neural networks, each neuron forms a weighted sum of


its inputs and passes the resulting scalar value through a function
referred to as an activation function.
1. Sigmoid

Sigmoid accepts a number as input and returns a number between 0 and 1.It’s

simple to use and has all the desirable qualities of activation functions:

nonlinearity, continuous differentiation, monotonicity, and a set output range.

This is mainly used in binary classification problems. This sigmoid function gives

theprobability of an existence of a particular class.

Mathematically, it can be represented as:

2.ReLU (Rectified Linear Unit)

ReLU stands for Rectified Linear Unit and is one of the most commonly used

activation function in the applications. It’s solved the problem of vanishing

gradient because the maximum value of the gradient of ReLU function is one. It

also solved the problem of saturating neuron, since the slope is never zero for

ReLU function. The range of ReLU is between 0 and infinity.

Mathematically, it can be represented as:


3.Leaky ReLU

Leaky ReLU is an upgraded version of the ReLU activation function to solve the

dying ReLU problem, as it has a small positive slope in the negative area. But,

the consistency of the benefit across tasks is presently ambiguous.

Mathematically, it can be represented as,


Support Vector Machines

Support Vector Machines (SVMs) are a type of machine learning algorithm used for
classification and regression analysis. SVMs work by finding a boundary or hyperplane that
maximally separates classes or groups of data in high dimensional space. This boundary is
chosen to have the largest margin or distance to the nearest data points of the different
classes, making the SVM robust to noise and outliers in data.

SVMs have many practical applications in fields such as image recognition, natural language
processing, and finance to name a few. They are effective in solving linear and nonlinear
classification problems, as well as regression tasks. SVMs can also handle large datasets
and can be optimized through kernel functions to handle non-linearly separable data.

One of the limitations of SVMs is that they can be computationally expensive when trained
on large datasets, or when the number of features is very high in the data. Additionally,
SVMs require careful tuning of hyperparameters such as the kernel function and
regularization coefficient in order to achieve optimal performance. Finally, the interpretability
of the SVM model is limited as compared to other machine learning models like decision
trees or linear regression.

➢Margin and Maximization


Support Vectors: These are the points that are closest to the hyperplane. A

separating line will be defined with the help of these data points.

Margin: it is the distance between the hyperplane and the observations closest

to the hyperplane (support vectors). In SVM large margin is considered a good

margin.
In Support Vector Machine (SVM), margin refers to the distance between the
decision boundary and the closest data points from both classes. The larger
the margin, the better the SVM model's generalization ability and robustness
against overfitting. Therefore, SVM with a larger margin is more preferable as
it helps to minimize the classification error and generalize the model better.

Maximization is an essential aspect of SVM, which aims to find the hyperplane


that can separate the two classes maximally while maintaining the margin.
SVM tries to identify the hyperplane for which the distance between two
closest points from different classes is maximum. This process is known as
maximum margin classification, and it helps to create a more robust and
generalized SVM model.

In summary, margin and maximization are critical elements in SVM that help to
create a robust and generalized model with better classification accuracy. By
maximizing the margin, SVM finds the best separating hyperplane for the
given dataset, ensuring the model is more accurate in predicting the unseen
data.

➢The Primal Problem

The Primal Problem in Support Vector Machines (SVM) is a fundamental


optimization problem that involves finding the best hyperplane that separates
two classes of data points. Specifically, the goal is to find the hyperplane that
maximizes the margin, or the distance between the hyperplane and the
closest data points.

The primal problem can be formulated as a constrained optimization problem,


where the objective function is to maximize the margin and the constraints
ensure that all data points are correctly classified and lie on the correct side of
the hyperplane. This problem is solved using Lagrange multipliers to convert it
into its dual form, which is a quadratic programming problem that can be
efficiently solved using different algorithms.

The primal problem in SVM is essential because it serves as the foundation


for developing different SVM algorithms, such as linear SVM, non-linear SVM,
and kernel SVM. By solving the primal problem, one can obtain the optimal
hyperplane and predict the class of new data points based on their position
with respect to the hyperplane.

➢The Lagrangian Dual

To solve this optimization problem, we can use the lagrangian dual technique.
We introduce a set of lagrange multipliers, one for each constraint, and
construct a new function by combining the objective function and the
constraints. We then find the dual problem by maximizing this function with
respect to the lagrange multipliers.

The solution to the dual problem provides us with a set of values for the
lagrange multipliers, which can in turn be used to calculate the optimal
decision boundary. This boundary is given by the support vectors, which are
the data points that lie closest to the decision boundary and have non-zero
lagrange multipliers.
The lagrangian dual technique also has the advantage of allowing us to
handle non-linearly separable data by using the kernel trick. This involves
transforming the data into a higher-dimensional space where it can be linearly
separable, and then solving the dual problem in that space.

➢Solution to Lagrangian Dual


To solve the Lagrangian Dual in SVM, we first need to convert the original
problem into a dual problem. This is done by transforming the support vector
machine problem into its Lagrangian form, which involves adding Lagrange
multipliers to the objective function.

Once we have the Lagrangian form of the problem, we can take the partial
derivative of the function with respect to each variable and set it to zero. This
will give us a set of equations that can be solved for the Lagrange multipliers.

Once we have the Lagrange multipliers, we can use them to find the optimal
solution to the original problem. This is done by substituting the values of the
Lagrange multipliers back into the original optimization problem.
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

Bayesian Learning
Bayes Theorem : It is a cornerstone of the Bayesian learning method. It Is a way to calculate
the posterior probability P(h|D) from the prior probability P(h, together with P(D) and P(D|h).

P(h|D) = P(D|h) P(h)


P(D)

In many learning scenarios the learner consider some set of candidate hypotheses H and is
interested in finding the most probable hypothesis h€H given the observed data D.
In any such maximal probable hypothesis is called a maximum a posterior (MAP).
We can determine the MAP hypotheses by using bayes theorem to calculate the posterior
capability of each candidate hypothesis.
More precisely, we will say that hMAP is a maximum hypothesis provided.

hMAP = argmax P(h|D)


h€H

hMAP = argmax P(D|h) P(h)


h€H P(D)

hMAP = argmax P(D|h) P(h)


h€H

In the final step we removed the P(D) because it is a constant independent of h.


P(D|h) is often called likelihood of D given h, and any hypothesis that maximizes p(D|h) is called
a maximum likelihood (ML) hypothesis then we have

hML = argmax P(h|D)


h€H
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

BAYES OPTIMAL CLASSIFIER :-

The Bayes Optimal Classifier is a probabilistic model that predicts the most likely outcome
for a new situation. In this blog, we’ll have a look at Bayes optimal classifier and Naive
Bayes Classifier.
The Bayes Theorem, which provides a systematic means of computing a conditional
probability, is used to describe it. It’s also related to Maximum a Posteriori (MAP), a
probabilistic framework for determining the most likely hypothesis for a training
dataset.

Take a hypothesis space that has 3 hypotheses h1, h2, and h3.
The posterior probabilities of the hypotheses are as follows:
h1 -> 0.4
h2 -> 0.3
h3 -> 0.3
Hence, h1 is the MAP hypothesis. (MAP => max posterior).
Suppose a new instance x is encountered, which is classified negative by h2 and h3 but
positive by h1.
Taking all hypotheses into account, the probability that x is positive is .4 and the
probability that it is negative is therefore .6.
The classification generated by the MAP hypothesis is different from the most probable
classification in this case which is negative.
If the new example’s probable classification can be any value vj from a set V, the
probability P(vj/D) that the right classification for the new instance is vj is merely

The denominator is omitted since we’re only using this for comparison and all the values
of P(vj/D) will have the same denominator. The value vj , for which P (vj /D) is maximum,
is the best classification for the new instance.
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

A Bayes optimal classifier is a system that classifies new cases according to Equation. This
strategy increases the likelihood that the new instance will be appropriately classified.

Consider an example for Bayes Optimal Classification:


Let there be 5 hypotheses h1 through h5.

P(hi/D) P(F/ hi) P(L/hi) P(R/hi)


0.4 1 0 0
0.2 0 1 0
0.1 0 0 1
0.1 0 1 0
0.2 0 1 0
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

The MAP theory, therefore, argues that the robot should proceed forward (F). Let’s see
what the Bayes optimal procedure suggests.

Thus, the Bayes optimal procedure recommends the robot turn left.

Gibbs Algorithm :

• Bayes Optimal is quite costly to apply. It computes the posterior probabilities for every
hypothesis in and combines the predictions of each hypothesis to classify each new
instance
• An alternative (less optimal) method:

The Gibbs algorithm defined as follows

1. Choose a hypothesis from at random, according to the posterior probability


distribution over .
2. Use to predict the classification of the next instance .
• Under certain conditions the expected misclassification error for Gibbs algorithm is at
most twice the expected error of the Bayes optimal classifier.
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

NAVIE BAYES CLASSIFIER


Bayes Theorem is cornerstone of Naïve Bayes classifier , because it provides a weight to calculate
the posterior probability P(h|D), from the prior probability P(h) , together with P(D) and P(D|h).

P(h|D) = P(D|h) P(h)


P(D)

hMAP = argmax P(h|D)


h€H

hMAP = argmax P(D|h) P(h)


h€H P(D)

hMAP = argmax P(D|h) P(h)


h€H

The Bayesian approach to classify the new instance to assign the most probable target value.
VMAP given the attribute value a1,a2,………….an that describes the instance.

VMAP = argmax P(Vj |a1……an)


Vj € v

we can use the bayes theorem to re-write the expression as

VMAP = argmax P(a1……an| Vj) . P(Vj)


Vj € v P(a1………..an)

= argmax P(a1……an| Vj) . P(Vj)


Vj € v

Naïve Bayes classifier : VNB = argmax P( Vj) . ni P(ai|vj)


Vj € v
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

PROBLEMS

1. Estimate conditional probabilities of each attributes { Outlook, Temp, Humidity , Wind }


for the special clauses { yes, no } using the data table given below. Using this probabilities
estimate the probability values for the new instance.
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

Solution :
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

2. Estimate conditional probabilities of each attributes { Color , legs , height , Smiley } for the
special clauses { M , H } using the data table given below. Using this probabilities estimate
the probability values for the new instance.
(Color = Green , legs = 2, height = tall , Smiley = no)

Number Color Legs Height Smiley Species


1 White 3 Short Yes M
2 Green 2 Tall No M
3 Green 3 Short Yes M
4 White 3 Short Yes M
5 Green 2 Short No H
6 White 2 Tall No H
7 White 2 Tall No H
8 White 2 Short Yes H

Solution:
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

3. Classify new instance Red , SUV , Domestic as stolen or not based on given below data
set.

Number Color Type Origin Stolen ?


1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes
4 Yellow Sports Domestic No
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
Solution:
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

4. Classify new instance on given below data set.


A=0 , B=1 , C=0

Record A B C Class
1 0 0 0 +
2 0 0 1 -
3 0 1 1 -
4 0 1 1 -
5 0 0 1 +
6 1 0 1 +
7 1 0 1 -
8 1 0 1 -
9 1 1 1 +
10 1 0 1 +

Solution :
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

EXPECTATION MAXIMIZATION :

In the real world applications of ML , it is very common that there are many relevant features
available for the learning but only a small subset of them are observable.

1. EM – Algorithm can be used for latent variables.


2. This Algorithm is actually the base for many unsupervised clustering algorithms in the
field of Machine Learning.
3. Initially, a set of values of the parameters are considered. A set of incomplete observe
data is given to the system with the assumption that the observe data comes from a
specific model.
a. The next step is known as “EXPECTATION – STEP” (or) E-STEP. In this step we use the
observed data in order to estimate or guess the values of the missing or incomplete
data. It is basically used to update the variables.
b. The next step is known as “MAXIMIZATION-STEP” (or) M-STEP. In this step, we use
the complete data generated in the preceding EXPECTATION – STEP in order to update
the values in the parameters. It is basically used to update the hypotheses.
4. In this step it is check weather the values are converging (or) not,
If yes, then stop.
Otherwise repeat step 2 and step 3 i,e E-STEP and M-STEP until the convergence occurs.

ALGORITHM :

Step 1 : Given a set of incomplete data, consider a set of starting parameters.


Step 2 : EXPECTATION-STEP (E-STEP) : using the observed available data of the dataset, estimate
(guess) the values of the missing data.
Step 3 : MAXIMIZATION-STEP (M-STEP) : complete data generated after the EXPECTATION-STEP
is used in order tp update the parameters.
Step 4 : Repeat step 2 and step 3 until convergence.

USES :
• Used to fill missing data in a sample.
• Used as a basis of unsupervised learning of clusters.
• Used for the purpose of estimating the parameters of within Hidden Markov Model
(HMM).
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS

FLOWCHART :

START INTIAL VALUES

EXPECTATION-STEP

MAXIMIZATION-STEP

NO IS
CONVERGENCE
EGENCE

YES

STOP

ADVANTAGES :

• It is always guarantee that likelihood will be increase in each iteration.


• E-STEP, M-STEP are often pretty easy for many problems in terms of implementation.

DISADVANTAGES:

• It has slow convergence.


• Makes convergence for local options only.
Unit-4
Chapter1:- Unsupervised Learning

Hierarchical Clustering:
• Hierarchical Clustering is basically an unsupervised learning technique which involves creating clusters in a
predefined order. The clusters are ordered in a top-down manner or bottom-up manner.
• In this type of clustering, similar clusters are grouped together and are arranged in a hierarchical manner.
• It can be further divided into two types of Agglomerative hierarchical clustering and Divisive hierarchical
clustering.
• In this clustering we link the pairs of clusters all the data objects in a hierarchical manner.

Non-Hierarchical Clustering:
• Non-Hierarchical Clustering involves formation of new clusters by merging or splitting the clusters.
• It does not follow a tree like structure like in hierarchical clustering.
• This technique groups the data in order to maximize or minimize some evaluation criteria.
• In this method the partitions are named such that non-overlapping groups having more hierarchical
relationships between themselves.
• K-means clustering is an effective way of non-hierarchical clustering.

Hierarchical Clustering Non-Hierarchical Clustering


• Less reliable. • More reliable.
• Slower. • Faster.
• Easier to read and understand. • Difficult to read and understand.
• Relatively unstable. • Relatively stable.

Agglomerative Clustering:
• It is also known as bottom-up approach or hierarchical agglomerative clustering (HAC).
• A structure that ismore informative than the unstructured set of clusters written by flat clustering.
• This clustering algorithm does not require us to specify the number of clusters.
• Bottom-up algorithm treats each data has a singleton-clusters at the outset and then successively
agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all the
data.

Divisive Clustering:
• Also known as top-down approach. This algorithm also does not require to specify the number of clusters.
• Top-down clustering requires the method for splitting of clusters that contains the whole data and proceed
by splitting clusters recursively until individual data has been specified into singleton clusters.
K-means Clustering
Is a popular unsupervised machine learning algorithm used for clustering or grouping similar data points in a
dataset.The algorithm tries to partition a given dataset into a predefined number of clusters (K), where each data
point belongs to the cluster with the closest mean value.
K-means Algorithm for Clustering.
• K-means algorithm is unsupervised learning algorithm.
• Given a dataset of item, with certain featuresand values for three features, the algorithm will categorize the
items into k-groups or clusters of similarity.
• To Calculate the similarity, we can use the Euclidean distance, manhatton distance, Hamming distance, cosine
distance as measurement.
• The pseudocode for implementing k-means algorithm is given below
Input algorithm k-means [k number of clusters, D-desktop data points]
1) Choose K number of random data points as initial centroids [cluster centres].
2) Repeat till cluster centrestabilize:
a) Allocate each point in D to the nearest of kth centroids.
b) Compute Centroids for the cluster using all points in the cluster.

Flow chart represents the algorithm.

Advantages of k-means algorithm


• k-means algorithm it simple, easy to understand and easy to implement.
• It is also efficient that is the time takes to cluster K-means rises linearly with the number of datapoints.
• No other clustering algorithm performs better than k-means algorithm.

Disadvantages of k-means algorithm


• The user needs to specify an initial values of k.
• The process of finding the cluster may not converge global seating minima
• It is not suitable for discovering clusters that are not spheres.
Bisecting k-means Algorithm
• k-means algorithm has a fewlimitations which are as follows:
(a) which only identifies spherical shapecluster that is it cannot identify if the cluster are non-spherical or
of Various size and density.
(b) It suffers from local minima and has a problem when the data contains outliers.
• Bisecting. K-means algorithm is a modification of k-means algorithm. It can recognize cluster of any shape
and size. This algorithmis convenient because
i) It beats K-means in entropy measurements.
ii) when k is weak bisecting k-means more effecting.
iii) while k-means is known to yield clusters of varied sizes, bisecting k-means resultsin clusters of
comparable sizes.

Algorithm
1)Initialize the list of clusters to accommodate the cluster consisting of all point
2) Repeat
Discard a cluster from the list of clusters {perform several trial bisections of selected cluster trial}
for i=1 do number of trials:
bisect the selected clusters using basicK-mean
End for
Select the 2 cluster from the bisection with the least total SSE [Sum of squared Errors]. Until the list of clusters
contain k-clusters.

K-means as special case of Expectation


• Identifying crime localities
With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association
between the two can give quality insight into crime-prone areas within a city or a locality.
• Cyber-profiling criminals
Cyber profiling is the process of collecting data from individuals and groups to identify significant correlations. The idea of cyber
profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals
who were at the crime scene.
• Call record detail analysis
This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster
customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of
customers concerning their usage by hours.
• Anomaly detection
Anomaly detection refers to methods that provide warnings of unusual behaviors that may compromise communication networks'
security and performance. Anomalous behaviors can be identified by comparing the distance between real data and cluster
centroids. Identifying network anomalies is essential for communication networks of enterprises or institutions. The goal is to
provide an early warning about an unusual behavior that can affect the security and the performance of a network.
• Rideshare Data Analysis
The publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak
pickup localities, and more. Analyzing this data is useful not just in the context of uber but also in providing insight into urban
traffic patterns and helping us plan for the cities of the future.
• Document Classification
Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard
classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is
needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the
document. the document vectors are then clustered to help identify similarities in document groups.
• Delivery Store Optimization
Optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch
locations and a genetic algorithm to solve the truck route as a traveling salesman problem.
K-Medoids Algorithm
➢ K-Medoids is an unsupervised clustering algorithm in which data points called “medoids" act as the
cluster's center.
➢ A medoid is a point in the cluster whose sum of distances(also called dissimilarity) to all the objects
in the cluster is minimal.
➢ The distance can be the Euclidean distance, Manhattan distance, or any other suitable distance
function.
➢ Therefore, the K -medoids algorithm divides the data into K clusters by selecting K medoids from
our data sample.
Working of the Algorithm:-
1. Randomly select k points from the data( k is the number of clusters to be formed). These k points
would act as our initial medoids.
2. The distances between the medoid points and the non-medoid points are calculated, and each point is
assigned to the cluster of its nearest medoid.
3. Calculate the cost as the total sum of the distances(also called dissimilarities) of the data points from
the assigned medoid.
4. Swap one medoid point with a non-medoid point(from the same cluster as the medoid point) and
recalculate the cost.
5. If the calculated cost with the new medoid point is more than the previous cost, we undo the swap,
and the algorithm converges else; we repeat step 4
Finally, we will have k medoid points with their clusters.
Manhattan Distance between two points (x1,y1) and (x2,y2) is given as:
Mdist= |x2-x1| + |y2-y1|

Advantages of K-Medoid
1. K- medoids algorithm is robust to outliers and noise as the medoids are the most central data point of the
cluster, such that its distance from other points is minimal.
2. K-Medoids algorithm can be used with arbitrarily chosen dissimilarity measures (e.g., cosine similarity) or any
distance metric.

Disadvantages of K-Medoid
1. The K-medoids algorithm has large time complexity. Therefore, it works efficiently for small datasets but
doesn't scale well for large datasets.
2. The K-Medoid algorithm is not suitable for clustering non-spherical (arbitrary-shaped) groups of objects.
Chapter2:- Association Mining

Apriori Algorithm:-
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to work on
the databases that contain transactions. With the help of these association rule, it determines how strongly
or how weakly two objects are connected. This algorithm uses a breadth-first search and Hash Tree to
calculate the itemset associations efficiently. It is the iterative process for finding the frequent itemsets
from the large dataset.

This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used for market basket
analysis and helps to find those products that can be bought together. It can also be used in the healthcare
field to find drug reactions for patients.
Frequent Itemset-
Frequent itemsets are those items whose support is greater than the threshold value or user-specified
minimum support. It means if A & B are the frequent itemsets together, then individually A and B should
also be the frequent itemset.
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two transactions, 2 & 3 are
the frequent itemsets.
Steps for Apriori Algorithm
Below are the steps for the apriori algorithm:
Step-1: Determine the support of itemsets in the transactional database, and select the minimum support
and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum or selected
support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the threshold or
minimum confidence.
Step-4: Sort the rules as the decreasing order of lift.

Apriori Algorithm Working


Example: Suppose we have the following dataset that has various transactions, and from this dataset, we
need to find the frequent itemsets and generate the association rules using the Apriori algorithm:

Solution:
Step-1: Calculating C1 and L1:
In the first step, we will create a table that contains support count (The frequency of each itemset
individually in the dataset) of each itemset in the given dataset. This table is called the Candidate set or C1.
Now, we will take out all the itemsets that have the greater support count that the Minimum Support (2). It
will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support, except the E, so E
itemset will be removed.

Step-2: Candidate Generation C2, and L2:


In this step, we will generate C2 with the help of L1. In C2, we will create the pair of the itemsets of L1 in
the form of subsets.
After creating the subsets, we will again find the support count from the main transaction table of datasets,
i.e., how many times these pairs have occurred together in the given dataset. So, we will get the below
table for C2:

Again, we need to compare the C2 Support count with the minimum support count, and after comparing,
the itemset with less support count will be eliminated from the table C2. It will give us the below table for
L2

Step-3: Candidate generation C3, and L3:


For C3, we will repeat the same two processes, but now we will form the C3 table with subsets of three
itemsets together, and will calculate the support count from the dataset. It will give the below table:

Now we will create the L3 table. As we can see from the above C3 table, there is only one combination of
itemset that has support count equal to the minimum support count. So, the L3 will have only one
combination, i.e., {A, B, C}.

Step-4: Finding the association rules for the subsets:


To generate the association rules, first, we will create a new table with the possible rules from the occurred
combination {A, B.C}. For all the rules, we will calculate the Confidence using formula sup( A ^B)/A. After
calculating the confidence value for all rules, we will exclude the rules that have less confidence than the
minimum threshold(50%).
Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%


As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C, B^C → A, and A^C
→ B can be considered as the strong association rules for the given problem.

Advantages of Apriori Algorithm


• This is easy to understand algorithm
• The join and prune steps of the algorithm can be easily implemented on large datasets.

Disadvantages of Apriori Algorithm


• The apriori algorithm works slow compared to other algorithms.
• The overall performance can be reduced as it scans the database for multiple times.
• The time complexity and space complexity of the apriori algorithm is O(2D), which is very high. Here
D represents the horizontal width present in the database.

Association Rule Learning


Association rule learning is a type of unsupervised learning technique that checks for the dependency of one data
item on another data item and maps accordingly so that it can be more profitable. It tries to find some interesting
relations or associations among the variables of dataset. It is based on different rules to discover the interesting
relations between variables in the database.

The association rule learning is one of the very important concepts of machine learning, and it is employed
in Market Basket analysis, Web usage mining, continuous production, etc. Here market basket analysis is a
technique used by the various big retailer to discover the associations between items. We can understand it by taking
an example of a supermarket, as in a supermarket, all products that are purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these products are stored
within a shelf or mostly nearby. Consider the below diagram:
Association rule learning can be divided into below mentioned types of algorithms:
• Apriori Algorithm
• F-P Growth Algorithm

Working of Association Rule Learning:-

Support-
In data mining support refers to the directive frequency of an itemset in a dataset.
For Example, If an item set occurs in 5% of transaction in a dataset, it has a support of 5%.
Support is offen used as a threshold for identifying frequent itemset in a dataset,which can be used to
generate association rules.
In general the support of an itemset can be calculated using a formula:
Support(X)=No of transaction containing(x)/Total No of transaction .
Where x is the item set for which you are calculating the support.
Confidence-
In data mining confidence is a measure of the reliability or support for a given association rule. It is defined
as the proportion of cases in which the association rules holds true or in other words the % of times that
the items in the antecedent appears in the same transaction as the items in the consequent.
For Example, Suppose we have dataset of 1000 transaction and the itemset{A,B} appears in 100 of those
transaction, the itemset A appears in 200 of those transaction. The confidence of that rule would be
calculated as follows.,
Confidence(X^Y)=No of transaction containing(x^y)/ No of transaction containing(x).
Where x & y are the itemsets for which you are calculating the confidence.

Applications of Association Rule Learning


• Market Basket Analysis
➢ A data mining technique that is used to uncover purchase patters in any retail setting is known
as Market Basket Analysis.
➢ In simple terms, Market Basket Analysis in data mining is to analyse the combination of products
which have been brought together.
➢ Market Basket Analysis mainly works with the association rule {IF}=>{THEN}.
IF- It means Antecedent in which item found within the data.
THEN- It means Consequent which is an item found in combination with the antecedent.

FP growth- FP tree
The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method for mining the complete set
of frequent patterns by pattern fragment growth, using an extended prefix-tree structure for storing compressed and
crucial information about frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that his
method outperforms other popular methods for mining frequent patterns, e.g. the Apriori Algorithm and the Tree
Projection. In some later works, it was proved that FP-Growth performs better than other methods,
including Eclat and Relim. The popularity and efficiency of the FP-Growth Algorithm contribute to many studies that
propose variations to improve its performance.

FP Growth Algorithm
The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate generations, thus
improving performance. For so much, it uses a divide-and-conquer strategy. The core of this method is the usage of a
special data structure named frequent-pattern tree (FP-tree), which retains the item set association information.
This algorithm works as follows:
➢ First, it compresses the input database creating an FP-tree instance to represent frequent items.
➢ After this first step, it divides the compressed database into a set of conditional databases, each associated
with one frequent pattern.
➢ Finally, each such database is mined separately.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short patterns and then
concatenating them into the long frequent patterns.
In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with this problem is to
partition the database into a set of smaller databases (called projected databases) and then construct an FP-tree
from each of these smaller databases.

FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative information about
frequent patterns in a database. Each transaction is read and then mapped onto a path in the FP-tree. This
is done until all transactions have been read. Different transactions with common subsets allow the tree to
remain compact because their paths overlap.

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP tree is to
mine the most frequent pattern. Each node of the FP tree represents an item of the item set.

The root node represents null, while the lower nodes represent the item sets. The associations of the
nodes with the lower nodes, that is, the item sets with the other item sets, are maintained while forming
the tree.

Han defines the FP-tree as the tree structure given below:


1. One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-item-
header table.
2. Each node in the item-prefix subtree consists of three fields:
o Item-name: registers which item is represented by the node;
o Count: the number of transactions represented by the portion of the path reaching the
node;
o Node-link: links to the next node in the FP-tree carrying the same item name or null if there
is none.
3. Each entry in the frequent-item-header table consists of two fields:
• Item-name: as the same to the node;
• Head of node-link: a pointer to the first node in the FP-tree carrying the item name.
Additionally, the frequent-item-header table can have the count support for an item. The below diagram is
an example of a best-case scenario that occurs when all transactions have the same itemset; the size of the
FP-tree will be only a single branch of nodes.
The worst-case scenario occurs when every transaction has a unique item set. So the space needed to store
the tree is greater than the space used to store the original data set because the FP-tree requires additional
space to store pointers between nodes and the counters for each item. The diagram below shows how a
worst-case scenario FP-tree might appear. As you can see, the tree's complexity grows with each
transaction's uniqueness.

Algorithm by Han
Algorithm 1: FP-tree construction
Input: A transaction database DB and a minimum support threshold?
Output: FP-tree, the frequent-pattern tree of DB.
Method: The FP-tree is constructed as follows.
1. The first step is to scan the database to find the occurrences of the itemsets in the database. This step is the
same as the first step of Apriori. The count of 1-itemsets in the database is called support count or frequency
of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root is represented by
null.
3. The next step is to scan the database again and examine the transactions. Examine the first transaction and
find out the itemset in it. The itemset with the max count is taken at the top, and then the next itemset with
the lower count. It means that the branch of the tree is constructed with transaction itemsets in descending
order of count.
4. The next transaction in the database is examined. The itemsets are ordered in descending order of count. If
any itemset of this transaction is already present in another branch, then this transaction branch would share
a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions. The common node and new
node count are increased by 1 as they are created and linked according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined first, along with the links
of the lowest nodes. The lowest node represents the frequency pattern length 1. From this, traverse the path
in the FP Tree. This path or paths is called a conditional pattern base.
A conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the
lowest node (suffix).
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path. The itemsets meeting the
threshold support are considered in the Conditional FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.
Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects and sorts the set of
frequent items, and the second constructs the FP-Tree.
Example

Support threshold=50%, Confidence= 60%

Table 1:

Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

Solution: Support threshold=50% => 0.5*6= 3 => min_sup=3

Table 2: Count of each item

Item Count

I1 4

I2 5

I3 4

I4 4

I5 2

Table 3: Sort the itemset in descending order.

Item Count

I2 5

I1 4

I3 4

I4 4
Build FP Tree
1. Considering the root node null.
2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1}, where I2 is linked as a child,
I1 is linked to I2 and I3 is linked to I1.
3. T2: I2, I3, and I4 contain I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and I4 is linked to I3. But this
branch would share the I2 node as common as it is already used in T1.
4. Increment the count of I2 by 1, and I3 is linked as a child to I2, and I4 is linked as a child to I3. The count is
{I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node. Hence it will be
incremented by 1. Similarly I1 will be incremented by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2},
{I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2}, {I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.

Mining frequent items from an FP-tree:


1. The lowest node item, I5, is not considered as it does not have a min support count. Hence it is
deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore considering I4
as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1} this forms the conditional pattern base.
3. The conditional pattern base is considered a transaction database, and an FP tree is constructed. This
will contain {I2:2, I3:2}, I1 is not considered as it does not meet the min support count.
4. This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree : {I2:4, I1:3} and
frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and frequent
patterns are generated: {I2, I1:4}.

Item Conditional Pattern Base Conditional FP-tree Frequent Patterns Generated

I4 {I2,I1,I3:1},{I2,I3:1} {I2:2, I3:2} {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}

I3 {I2,I1:3},{I2:1} {I2:4, I1:3} {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}

I1 {I2:4} {I2:4} {I2,I1:4}


Advantages of FP Growth Algorithm
• This algorithm needs to scan the database twice when compared to Apriori, which scans the
transactions for each iteration.
• The pairing of items is not done in this algorithm, making it faster.
• The database is stored in a compact version in memory.
• It is efficient and scalable for mining both long and short frequent patterns.

Disadvantages of FP-Growth Algorithm


o FP Tree is more cumbersome and difficult to build than Apriori.
o It may be expensive.
o The algorithm may not fit in the shared memory when the database is large.

Dimensionality Reduction
o Dimensionality reduction is the process of reducing the number of features (or dimensions)
in a dataset while retaining as much information as possible.
o This can be done for a variety of reasons. -To reduce the complexity of a model.
-To improve the performance of a learning algorithm. -To
make it easier to visualize the data.
• A 3-D classification problem can be hard to visualize.
• 3-D feature space is split into two 2-D feature spaces, and later, if found to be correlated,
the number of features can be reduced even further.

There are two components of dimensionality reduction:


• Feature selection:In this, we try to find a subset of the original set of variables, or features, to get a
smaller subset which can be used to model the problem. It usually involves three ways:
➢ Filter
➢ Wrapper
➢ Embedded
• Feature extraction: This reduces the data in a high dimensional space to a lower dimension space,
i.e. a space with lesser no. of dimensions.

There are several techniques for dimensionality reduction:-


principal component analysis (PCA)
singular value decomposition (SVD)

Principal Component Analysis (PCA):


• This method was introduced by Karl Pearson.
• It works on the condition that while the data in a higher dimensional space is mapped to data in a
lower dimension space.
• the variance of the data in the lower dimensional space should be maximum.
It involves the following steps:
• Construct the covariance matrix of the data.
• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of
variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data loss in the
process. But, the most important variances should be retained by the remaining eigenvector.

Advantages of Dimensionality Reduction


• It helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.

Disadvantages of Dimensionality Reduction


• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables, which is sometimes undesirable.
• PCA fails in cases where mean and covariance are not enough to define datasets.
• We may not know how many principal components to keep- in practice, some thumb rules are
applied.

Singular Value Decomposition (SVD):


The Singular Value Decomposition (SVD) of a matrix is a factorization of that matrix into three matrices. It
has some interesting algebraic properties and conveys important geometrical and theoretical insights
about linear transformations. It also has some important applications in data science. In this article, I will
try to explain the mathematical intuition behind SVD and its geometrical meaning.

Mathematics behind SVD:


The SVD of mxn matrix A is given by the formula is-
Where,
• U: mxn matrix of the orthonormal eigenvectors of .
• VT: transpose of a nxn matrix containing the orthonormal eigenvectors of A^{T}A.
• W: a nxn diagonal matrix of the singular values which are the square roots of the eigenvalues of
.
Examples
➢ Find the SVD for the matrix A =
➢ To calculate the SVD, First, we need to compute the singular values by finding eigenvalues of AA^{T}.

• The characteristic equation for the above matrix is :

so our singular values are:


• Now we find the right singular vectors i.e orthonormal set of eigenvectors of A TA. The eigenvalues of
ATA are 25, 9, and 0, and since ATA is symmetric we know that the eigenvectors will be orthogonal.

which can be row-reduces to :

Applications:
➢ Calculation of Pseudo-inverse
➢ Rank, Range, and Null space
➢ Curve Fitting Problem..
UNIT-5
GENETIC ALGORITHMS

GENETIC ALGORITHMS

The problem addressed by GAS is to search a space of candidate hypotheses to


identify the best hypothesis.

In GAS the "best hypothesis" is defined as the one that optimizes a predefined
numerical measure for the problem at hand, called the hypothesis fitness.

For example :- If the learning task is the problem of approximating an unknown


function given training examples of its input and output, then fitness could be
defined as the accuracy of the hypothesis over this training data.

If the task is to learn a strategy for playing chess, fitness could be


defined as the number of games won by the individual when playing against other
individuals in the current population.

Genetic algorithms typically share the following structure:

o The algorithm operates by iteratively updating a pool of hypotheses, called the


population. On each iteration, all members of the population are evaluated
according to the fitness function.
o A new population is then generated by probabilistically selecting the most fit
individuals from the current population.
o Some of these selected individuals are carried forward into the next
generation population intact.
o Others are used as the basis for creating new offspring individuals by applying
genetic operations such as crossover and mutation.

GA (Fitness , Fitness_threshold , p , r , m)
Fitness: A function that assigns an evaluation score, given a hypothesis.

Fitness_threshold: A threshold specifying the termination criterion.

p: The number of hypotheses to be included in the population.


r: The fraction of the population to be replaced by Crossover at each step.

m: The mutation rate.

 Initialize population: P c Generate p hypotheses at random


 Evaluate: For each h in P, compute Fitness(h)'
 While [max Fitness(h)] < Fitness_threshold do

Create a new generation, Ps:

1. Select: probabilistically select (1 - r)p members of P to add to Ps. The probability Pr(hi) of
selecting hypothesis hi from P is given by

2. Crossover: Probabilistically select pairs of hypotheses from P, according to &(hi) given


above. For each pair, (hl, h2), produce two offspring by applying the Crossover operator. Add
all offspring to ps.

3. Mutate: Choose m percent of the members of P, with uniform probability. For each, invert
one randomly selected bit in its representation.

4. Update: P←ps.

5. Evaluate: for each h in P, compute Fitness(h)

Return the hypothesis from P that has the highest fitness.

 A prototypical genetic algorithm is described in Table 9.1. The inputs to this algorithm include the
fitness function for ranking candidate hypotheses, a threshold defining an acceptable level of fitness
for terminating the algorithm, the size of the population to be maintained, and parameters that
determine how successor populations are to be generated: the fraction of the population to be
replaced at each generation and the mutation rate.
 Notice in this algorithm each iteration through the main loop produces a new generation of
hypotheses based on the current population.
 First, a certain number of hypotheses from the current population are selected for inclusion in the
next generation. These are selected probabilistically, where the probability of selecting hypothesis hi
is given by
 Thus, the probability that a hypothesis will be selected is proportional to its own fitness and is
inversely proportional to the fitness of the other competing hypotheses in the current population.
 Once these members of the current generation have been selected for inclu- sion in the next
generation population, additional members are generated using a crossover operation.
 Crossover, defined in detail in the next section, takes two par- ent hypotheses from the current
generation and creates two offspring hypotheses by recombining portions of both parents.
 The parent hypotheses are chosen proba- bilistically from the current population, again using the
probability function given by Equation (9.1).
 After new members have been created by this crossover opera- tion, the new generation population
now contains the desired number of members. At this point, a certain fraction m of these members
are chosen at random, and random mutations all performed to alter these members.
 This GA algorithm thus performs a randomized, parallel beam search for hypotheses that perform
well according to the fitness function. In the follow- ing subsections, we describe in more detail the
representation of hypotheses and genetic operators used in this algorithm.

Representing Hypotheses
Hypotheses in GAS are often represented by bit strings, so that they can be easily manipulated by genetic
operators such as mutation and crossover.

Example :

Consider the attribute Outlook, which can take on any of the three values Sunny, Overcast, or Rain. One
obvious way to represent a constraint on Outlook is to use a bit string of length three.

In which each bit position corresponds to one of its three possible values.

Placing a 1 in some position indicates that the attribute is allowed to take on the corresponding value.

For example :

The string 010 represents the constraint that Outlook must take on the second of these values,
Outlook = Overcast.

Similarly, the string 011 represents the more general constraint that allows two possible values, or

(Outlook = Overcast v Rain).

Example :

Consider a second attribute, Wind, that can take on the value Strong or Weak. A rule precondition such as

(Outlook = Overcast V Rain) A (Wind = Strong)

can then be represented by the following bit string of length five:

Outlook Wind

011 10
Rule Postconditions (such as PlayTennis = yes) can be represented in a similar fashion. Thus, an entire rule
can be described by concatenating the bit strings describing the rule preconditions, together with the bit
string describing the rule Postcondition. For example, the rule

IF Wind = Strong THEN PlayTennis = yes

would be represented by the string

Outlook Wind PlayTennis

111 10 10

where the first three bits describe the "don't care" constraint on Outlook, the next two bits describe the
constraint on Wind, and the final two bits describe the rule postcondition

Genetic Operators
The crossover operator produces two new offspring from two parent strings, by copying selected bits from
each parent. The bit at position i in each offspring is copied from the bit at position i in one of the two
parents. The choice of which parent contributes the bit for position i is determined by an additional string
called the crossover mask.

Single-point crossover:
Consider the topmost of the two offspring in this case. This offspring takes its first five bits from the first
parent and its remaining six bits from the second parent, because the crossover mask 111 1 1000000
specifies these choices for each of the bit positions. The second offspring uses the same crossover mask, but
switches the roles of the two parents. Therefore, it contains the bits that were not used by the first
offspring.

Two-point Crossover :
In two-point crossover, offspring are created by substituting intermediate segments of one parent into the
middle of the second parent string. Put another way, the crossover mask is a string beginning with no zeros,
followed by a contiguous string of n1 ones, followed by the necessary number of zeros to complete the
string. Each time the two-point crossover operator is applied, a mask is generated by randomly choosing the
integers no and nl. For instance, in the example shown in Table 9.2 the offspring are created using a mask for
which n0 = 2 and n 1 = 5. Again, the two offspring are created by switching the roles played by the two
parents.
Uniform crossover :
Uniform crossover combines bits sampled uniformly from the two parents, as illustrated in Table 9.2. In this
case the crossover mask is generated as a random bit string with each bit chosen at random and independent
of the others.

Mutation
 The mutation operator produces small random changes to the bit string by choosing a single bit at
random, then changing its value.
 Mutation is often performed after crossover has been applied as in our prototypical algorithm from
Table 9.1.
 Some GA systems employ additional operators, especially operators that are specialized to the
particular hypothesis representation used by the system.

For example:- Grefenstette et al. (1991) describe a system that learns sets of rules for robot control. It uses
mutation and crossover, together with an operator for specializing rules. Janikow (1993) describes a system
that learns sets of rules using operators that generalize and specialize rules in a variety of directed ways (e.g.,
by explicitly replacing the condition on an attribute by "don't care").

Example: 11101001000 -------→ 11101011000 The ‘0’ is changed to ‘1’.

Fitness Function
 The fitness function defines the criterion for ranking potential hypotheses and for
probabilistically selecting them for inclusion in the next generation population.
 If the task is to learn classification rules, then the fitness function typically has a component
that scores the classification accuracy of the rule over a set of provided training examples.
 Often other criteria may be included as well, such as the complexity or generality of the rule.
 when the bit-string hypothesis is interpreted as a complex procedure (e.g., when the bit
string represents a collection of if-then rules that will be chained together to control a
robotic device), the fitness function may measure the overall performance of the resulting
procedure rather than performance of individual rules.
Selection
 The probability that a hypothesis will be selected is given by the ratio of its fitness to the
fitness of other members of the current population as seen in Equation (9.1). This method is
sometimes called fitness proportionate selection, or roulette wheel selection. Other methods
for using fitness to select hypotheses have also been proposed.
 For example, in tournament selection, two hypotheses are first chosen at random from the
current population.
 With some predefined probability p the more fit of these two is then selected, and with probability
(1 - p) the less fit hypothesis is selected.
 Tournament selection often yields a more diverse population than fitness proportionate
selection.
 In another method called rank selection, the hypotheses in the current population are first
sorted by fitness.
 The probability that a hypothesis will be selected is then proportional to its rank in this
sorted list, rather than its fitness.

Simple Application of The Genetic Algorithm

Genetic Algorithms are primarily used in optimization problems of various kinds, but they are
frequently used in other application areas as well.

In this section, we list some of the areas in which Genetic Algorithms are frequently used.
These are −

 Optimization − Genetic Algorithms are most commonly used in optimization problems


wherein we have to maximize or minimize a given objective function value under a given set
of constraints. The approach to solve Optimization problems has been highlighted
throughout the tutorial.

 Economics − GAs are also used to characterize various economic models like the cobweb
model, game theory equilibrium resolution, asset pricing, etc.

 Neural Networks − GAs are also used to train neural networks, particularly recurrent neural
networks.

 Parallelization − GAs also have very good parallel capabilities, and prove to be very effective
means in solving certain problems, and also provide a good area for research.
 Image Processing − GAs are used for various digital image processing (DIP) tasks as well like
dense pixel matching.
 Vehicle routing problems − With multiple soft time windows, multiple depots and a
heterogeneous fleet.

 Scheduling applications − GAs are used to solve various scheduling problems as well,
particularly the time tabling problem.

 Machine Learning − as already discussed, genetics based machine learning (GBML) is a niche
area in machine learning.

 Robot Trajectory Generation − GAs have been used to plan the path which a robot arm
takes by moving from one point to another.

 Parametric Design of Aircraft − GAs have been used to design aircrafts by varying the
parameters and evolving better solutions.

 DNA Analysis − GAs have been used to determine the structure of DNA using spectrometric
data about the sample.

 Multimodal Optimization − GAs are obviously very good approaches for multimodal
optimization in which we have to find multiple optimum solutions.

 Traveling salesman problem and its applications − GAs have been used to solve the TSP,
which is a well-known combinatorial problem using novel crossover and packing strategies.

Application of GA in Decision Tree

A decision tree is a flow-chart-like tree structure, where each internal node


(nonleaf node) denotes a test on attribute, each branch represents an outcome
of the test, and each leaf node (or terminal node) holds a class label. The
topmost node in a tree is the root node. A typical decision tree is shown in
Figure 2. The decision tree algorithm usually has three popular attribute
selection measures, namely, information gain, gain ratio, and gini index.
Genetic Algorithm Based Clustering
 Genetic Algorithm based clustering (GABC) is a method used in machine learning and data mining to
cluster data points based on their similarity. The method combines the principles of genetic
algorithms with clustering techniques to find the optimal number of clusters and their centroids.

 In GABC, the clustering problem is framed as an optimization problem where the goal is to find the
optimal set of clusters and their centroids that minimize the sum of squared distances between the
data points and their respective cluster centroids. The genetic algorithm is then used to search the
solution space and find the optimal solution.

Single Objective and Bi-objective Optimization problems using GA

 A single objective optimization problem involves maximizing or minimizing a single objective


function.

Example of Single Objective Problem using GA :

Consider the following function:

f(x) = x^2 - 6x + 5

To find the minimum value of this function, we can use GA. We start by defining a chromosome
representation, which in this case can be a binary string representing the values of x. We also need to define
the fitness function, which is simply the value of the function evaluated at a given value of x.
We can then initialize a population of candidate solutions, and use selection, crossover, and mutation
operators to evolve the population over a number of generations. The fittest individuals in each generation
are selected for reproduction, and the process continues until a satisfactory solution is found.

 A bi-objective optimization problem involves optimizing two conflicting objective functions


simultaneously.

Example of Bi-objective Optimization Problem using GA:

Consider the following bi-objective optimization problem:

Minimize f1(x) = (x-2)^2 + (y-1)^2


Minimize f2(x) = (x+2)^2 + (y-3)^2

To solve this problem using GA, we can use a multi-objective GA approach. We start by defining a
chromosome representation, which can be a binary string representing the values of x and y. We also need
to define the fitness function, which is a vector of the values of f1(x) and f2(x).

We can then initialize a population of candidate solutions, and use selection, crossover, and mutation
operators to evolve the population over a number of generations. In each generation, we need to use a
selection operator that is capable of handling multiple objectives, such as Pareto dominance or fitness
sharing.

The fittest individuals in each generation are selected for reproduction, and the process continues until a
satisfactory set of non-dominated solutions is found. The set of non-dominated solutions represents the
trade-offs between the two objectives, and can be visualized using a Pareto front plot.

Using GA to emulate Gradient descent/ascent

 Genetic Algorithms (GA) and Gradient Descent (GD) are two different optimization techniques that can
be used to optimize a function. While GA is a population-based optimization method inspired by
biological evolution, GD is an iterative optimization method that relies on the calculation of the gradient
of the objective function.

 Emulating GD/gradient ascent using GA can be achieved by representing the population as a set of
candidate solutions, and the fitness of each candidate is evaluated by calculating the value of the
objective function at that point. The objective function can be seen as the fitness function in this case.

 To emulate gradient descent, the GA algorithm can be modified by using selection, crossover, and
mutation operators that bias the search towards the direction of steepest descent. For example,
selection can be biased towards selecting individuals with lower fitness values, and mutation can be
biased towards decreasing the value of certain genes to move towards the minimum point.
 Similarly, to emulate gradient ascent, the GA algorithm can be modified by using selection, crossover,
and mutation operators that bias the search towards the direction of steepest ascent. For example,
selection can be biased towards selecting individuals with higher fitness values, and mutation can be
biased towards increasing the value of certain genes to move towards the maximum point.

 However, it's worth noting that while GA and GD are both optimization techniques, they differ in terms
of their efficiency and suitability for different types of problems. GD is generally more efficient for
functions with continuous and differentiable derivatives, while GA can handle non-differentiable
functions or problems with a large number of local optima. Therefore, it's important to choose the
appropriate optimization technique based on the nature of the problem.

You might also like