Machine Learning Moudle - 1: There Are Three Main Types of Machine Learning
Machine Learning Moudle - 1: There Are Three Main Types of Machine Learning
MOUDLE – 1
INTRODUCTION :
Machine learning is a subfield of artificial intelligence that involves developing
algorithms and models that enable computers to learn from data and make predictions
or decisions without being explicitly programmed.
In traditional programming, a set of rules or instructions is provided to a computer to
execute a particular task. In machine learning, the computer is trained on a large
dataset and uses statistical techniques to identify patterns and relationships in the
data. The model is then used to make predictions or decisions on new, unseen data.
ISSUES :
1. Bias: Machine learning algorithms can inherit biases from the data they are trained
on, which can lead to unfair or discriminatory outcomes.
2. Transparency: Machine learning algorithms can be complex and difficult to
understand, making it hard to explain how decisions are being made.
3. Privacy: Machine learning often involves the use of personal data, raising concerns
about how that data is collected, stored, and used.
4. Security: Machine learning models can be vulnerable to attacks, such as data
poisoning or model inversion attacks, which can compromise the security of the
system.
5. Accountability: Machine learning algorithms can make decisions that have
significant impacts on people's lives, raising questions about who is responsible for the
outcomes and how to hold them accountable.
Step 1) Choosing the Training Experience: The very important and first task is to
choose the training data or training experience which will be fed to the Machine
Learning Algorithm. It is important to note that the data or experience that we fed to
the algorithm must have a significant impact on the Success or Failure of the Model.
So Training data or experience should be chosen wisely.
Step 2- Choosing target function: The next important step is choosing the target
function. It means according to the knowledge fed to the algorithm the machine
learning will choose NextMove function which will describe what type of legal moves
should be taken.
For example : While playing chess with the opponent, when opponent will play then
the machine learning algorithm will decide what be the number of possible legal
moves taken in order to get success.
Step 5- Final Design: The final design is created at last when system goes from
number of examples , failures and success , correct and incorrect decision and what
will be the next step etc. Example: DeepBlue is an intelligent computer which is ML-
based won chess game against the chess expert Garry Kasparov, and it became the
first computer which had beaten a human chess expert.
CONCEPTS OF HYPOTHESES :
In supervised learning, a hypothesis takes the form of a function that maps input
variables to output variables. The function is learned from a set of labeled examples,
where the input variables and their corresponding output variables are known. The
goal is to find the function that best fits the training data, and that can be used to
predict the output variables for new, unseen input variables.
In unsupervised learning, a hypothesis takes the form of a model that captures the
underlying structure or patterns in the data. The model is learned from unlabeled data,
and the goal is to identify the structure or patterns that can be used to explain the
data.
Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known as a
hypothesis set. It is used by supervised machine learning algorithms to determine the best possible
hypothesis to describe the target function or best maps input to output.
It is often constrained by choice of the framing of the problem, the choice of model, and the choice of
model configuration.
Hypothesis (h):
It is defined as the approximate function that best describes the target in supervised
machine learning algorithms. It is primarily based on data as well as bias and restrictions
applied to data.
Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper
output and can be evaluated as well as used to make predictions.
y= mx + b
Where,
Y: Range
m: Slope of the line which divided test data or changes in y divided by change in x.
x: domain
c: intercept (constant)
VERSION SPACE :
A concept is complete if it covers all positive examples. A concept is consistent if it
covers none of the negative examples. The version space is the set of all complete and
consistent concepts. This set is convex and is fully defined by its least and most general
elements. The key idea in the CANDIDATE-ELIMINATION algorithm is to output a
description of the set of all hypotheses consistent with the training examples.
Representation :
The CANDIDATE-ELIMINATION algorithm finds all describable hypotheses that are
consistent with the observed training examples. In order to define this algorithm
precisely, we begin with a few basic definitions.
First, let us say that a hypothesis is consistent with the training examples if it correctly
classifies these examples.
Definition: A hypothesis h is consistent with a set of training examples D if and only if
h(x) = c(x) for each example (x, c(x)) in D.
Notice the key difference between this definition of consistent and our earlier
definition of satisfies. An example x is said to satisfy hypothesis h when h(x) = 1,
regardless of whether x is a positive or negative example of the target concept.
However, whether such an example is consistent with h depends on the target concept,
and in particular, whether h(x) = c(x).
The CANDIDATE-ELIMINATION algorithm represents the set of all hypotheses
consistent with the observed training examples. This subset of all hypotheses is called
the version space with respect to the hypothesis space H and the training examples D,
because it contains all plausible versions of the target concept.
Definition: The version space, denoted VSHVD, with respect to hypothesis space H
and training examples D, is the subset of hypotheses from H consistent with the
training examples in D.
♦ V S h, d = {h E H I Consistent (h, D)]
PERPORMANCE MATRICS :
Confusion Matrix
It is the easiest way to measure the performance of a classification problem where the
output can be of two or more type of classes. A confusion matrix is nothing but a table
with two dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions
have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives
(FN)” as shown below −
False Positives (FP)
Accuracy
Accuracy is a fraction of predictions our model got right .
Accuracy= number of correct predictions / total number of predictions
I.e., Accuracy = TP+TN / TP+TN+FP+FN
Precision
Precision, used in document retrievals, may be defined as the number of correct
documents returned by our ML model. We can easily calculate it by confusion matrix
with the help of following formula −
Precision=TP/TP+FP
Recall or Sensitivity
Recall may be defined as the number of positives returned by our ML model. We can
easily calculate it by confusion matrix with the help of following formula −
Recall=TP/TP+FN
Specificity
Specificity, in contrast to recall, may be defined as the number of negatives returned by
our ML model. We can easily calculate it by confusion matrix with the help of following
formula −
Specificity=TN/TN+FP
AUC (Area Under ROC curve):
AUC (Area Under Curve)-ROC (Receiver Operating Characteristic) is a performance
metric, based on varying threshold values, for classification problems. As name suggests,
ROC is a probability curve and AUC measure the separability. In simple words, AUC-ROC
metric will tell us about the capability of model in distinguishing the classes. Higher the
AUC, better the model.
Mathematically, it can be created by plotting TPR (True Positive Rate) i.e. Sensitivity or
recall vs FPR (False Positive Rate) i.e. 1-Specificity, at various threshold values. Following
is the graph showing ROC, AUC having TPR at y-axis and FPR at x-axis −
o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias
model also cannot perform well on new data.
Low variance means there is a small variation in the prediction of the target function
with changes in the training data set.
At the same time, High variance shows a large variation in the prediction of the target
function with changes in the training dataset.
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
o It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents
the outcome.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
OR
Every Variable in Boolean function such as A, B, C etc. has two possibilities that is True
and False
S- The current dataset for which entropy is being calculated(changes every iteration of
the ID3 algorithm).
I- Set of classes in S {example - I ={yes, no}}
p(i) - The proportion of the number of elements in class I to the number of elements in
set S.
In ID3, entropy is calculated for each remaining attribute. The attribute with the
smallest entropy is used to split the set S on that particular iteration.
Entropy = 0 implies it is of pure class, that means all are of same category.
Information Gain IG(A) tells us how much uncertainty in S was reduced after
splitting set S on attribute A. Mathematical representation of Information gain is
shown here –
Where,
If all Examples are positive, Return the single-node tree Root, with label = +/yes
If all Examples are negative, Return the single-node tree Root, with label= -/No
If Attributes is empty, Return the single-node tree Root, with label = most
common value of Target_attribute in Examples .
Otherwise Begin
Then below this new branch add a leaf node with label = most
common value of Target_attribute in examples
End
Return Root
Here,dataset is of binary classes(yes and no), where 9 out of 14 are "yes" and 5 out of 14
are "no".
For each attribute of the dataset, let's follow the step-2 of pseudocode : -
Here, the attribute with maximum information gain is Outlook. So, the decision tree built
so far -
Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset
rows = [1, 2, 8, 9, 11]}.
Here, the attribute with maximum information gain is Humidity. So, the decision tree built
so far -
Here, when Outlook = Sunny and Humidity = High, it is a pure class of category "no". And
When Outlook = Sunny and Humidity = Normal, it is again a pure class of category "yes".
Therefore, we don't need to do further calculations.
Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset
rows = [4, 5, 6, 10, 14]}.
Here, when Outlook = Rain and Wind = Strong, it is a pure class of category "no". And
When Outlook = Rain and Wind = Weak, it is again a pure class of category "yes".
And this is our final desired tree for the given dataset.
Issues in Decision tree learning :
Avoiding overfitting the data.
Incorporation continue valued attributes.
Alternative measure for selecting attributes.
Handling training examples with missing attribute value.
Handling attribute with differing cost.
INDUCTIVE BIAS IN DECISION TREE :
Given a collection of training example there are typically many decision trees
consistent with these example
Describing the inducting bias of ID3 therefore consists of describing the bias by
which it chooses one of these consistent hypothesis over the others.
Which of these decision trees does ID3 choose.
It chooseses the first acceptable tree it encounters in its simple-2-complex ,Hill
climbing search through the space possible trees.
1)Approximate, inductive bias of ID3.
o shorter trees are preferred over longer trees.
2)Closer approximation to the inductive bias of ID3.
o shorter trees are preferred over longer trees.
3)Trees that place high information attributes close the root are preferred
over those that do not.
FIND –S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS :
How can we use the more-general-than partial ordering to organize the search for a
hypothesis consistent with the observed training examples? One way is to begin with
the most specific possible hypothesis in H, then generalize this hypothesis each time it
fails to cover an observed positive training example .To be more precise about how the
partial ordering is used, consider the FIND-S algorithm.
Algorithm
Initialize h to the most specific hypothesis in H
For each positive training instance x
o For each attribute constraints Ai in h
If the constraint ai is satisfied by x
then do Nothing
Else replace a i in h by the next more general constraints that is
satisfied x
output hypothesis h
CANDIDATE – ELIMINATION LEARNING ALGORITHM :
The CANDIDATE-ELIMINATION algorithm computes the version space containing all
hypotheses from H that are consistent with an observed sequence of training
examples.
It begins by initializing the version space to the set of all hypotheses in H; that is, by
initializing the G boundary set to contain the most general hypothesis in H
Go<---- { (?, ?, ?, ?, ?, ?) }
and initializing the S boundary set to contain the most specific (least general)
hypothesis
So<---{ (Ø,Ø,Ø,Ø,Ø,Ø) }
These two boundary sets delimit the entire hypothesis space, because every other
hypothesis in H is both more general than So and more specific than Go. As each
training example is considered, the S and G boundary sets are generalized and
specialized, respectively, to eliminate from the version space any hypotheses found
inconsistent with the new training example. After all examples have been processed,
the computed version space contains all the hypotheses consistent with these
examples and only these hypotheses.
This algorithm is summarized.
Initialize G to the set of maximally general hypotheses in H.
Initialize S to the set of maximally specific hypotheses in H
For each training example d, do
o If d is a positive example
♦ Remove from G any hypothesis inconsistent with d
♦ For each hypothesis s in S that is not consistent with d
Remove s from S
Add to S all minimal generalizations h of s such that
h is consistent with d, and some member of G is
more general than h
Remove from S any hypothesis that is more general
than another hypothesis in S
o If d is a negative example
♦ Remove from S any hypothesis inconsistent with d
♦ For each hypothesis g in G that is not consistent with d
Remove g from G
Add to G all minimal specializations h of g such that
h is consistent with d, and some member of S is
more specific than h
Remove from G any hypothesis that is less general than
another hypothesis in G
Training algorithm:
● For each training example (x, f (x)), add the example to the list
training examples
Classification algorithm:
Backpropagation
1. Backpropagation, or backward propagation of errors, is an algorithm
that is designed to test for errors working back from output nodes to
input nodes. It is an important mathematical tool for improving the
accuracy of predictions in data mining and machine learning.
2.
Where
E = sum the errors over all of
the network output units
tkd and Okd
are the target and output values associated with the kth output unit
and training example d.
3. The learning problem faced by BACKPROPAGATION is to search a
large hypothesis space defined by all possible weight values for all the
units in the network
4. #28 Back Propagation Algorithm With Example Part-1 |ML|
Multilayer Networks
surfaces
among 10 possible vowels, all spoken in the context of "h-d" (i.e., "hid," "had,"
1. multiple layers of cascaded linear units still produce only linear functions,
and we prefer networks capable of representing highly nonlinear functions.
2. the sigmoid unit-a unit very much like a perceptron, but based on a
smoothed, differentiable threshold function.
3. The sigmoid unit is illustrated below:
The sigmoid unit first computes a linear combination of its inputs, then applies
a threshold to the result.
Where ,
σ = sigmoid function
the term e-y in the sigmoid function definition but sometimes replaced by e-k
➔Activation Units
Sigmoid accepts a number as input and returns a number between 0 and 1.It’s
simple to use and has all the desirable qualities of activation functions:
This is mainly used in binary classification problems. This sigmoid function gives
ReLU stands for Rectified Linear Unit and is one of the most commonly used
gradient because the maximum value of the gradient of ReLU function is one. It
also solved the problem of saturating neuron, since the slope is never zero for
Leaky ReLU is an upgraded version of the ReLU activation function to solve the
dying ReLU problem, as it has a small positive slope in the negative area. But,
Support Vector Machines (SVMs) are a type of machine learning algorithm used for
classification and regression analysis. SVMs work by finding a boundary or hyperplane that
maximally separates classes or groups of data in high dimensional space. This boundary is
chosen to have the largest margin or distance to the nearest data points of the different
classes, making the SVM robust to noise and outliers in data.
SVMs have many practical applications in fields such as image recognition, natural language
processing, and finance to name a few. They are effective in solving linear and nonlinear
classification problems, as well as regression tasks. SVMs can also handle large datasets
and can be optimized through kernel functions to handle non-linearly separable data.
One of the limitations of SVMs is that they can be computationally expensive when trained
on large datasets, or when the number of features is very high in the data. Additionally,
SVMs require careful tuning of hyperparameters such as the kernel function and
regularization coefficient in order to achieve optimal performance. Finally, the interpretability
of the SVM model is limited as compared to other machine learning models like decision
trees or linear regression.
separating line will be defined with the help of these data points.
Margin: it is the distance between the hyperplane and the observations closest
margin.
In Support Vector Machine (SVM), margin refers to the distance between the
decision boundary and the closest data points from both classes. The larger
the margin, the better the SVM model's generalization ability and robustness
against overfitting. Therefore, SVM with a larger margin is more preferable as
it helps to minimize the classification error and generalize the model better.
In summary, margin and maximization are critical elements in SVM that help to
create a robust and generalized model with better classification accuracy. By
maximizing the margin, SVM finds the best separating hyperplane for the
given dataset, ensuring the model is more accurate in predicting the unseen
data.
To solve this optimization problem, we can use the lagrangian dual technique.
We introduce a set of lagrange multipliers, one for each constraint, and
construct a new function by combining the objective function and the
constraints. We then find the dual problem by maximizing this function with
respect to the lagrange multipliers.
The solution to the dual problem provides us with a set of values for the
lagrange multipliers, which can in turn be used to calculate the optimal
decision boundary. This boundary is given by the support vectors, which are
the data points that lie closest to the decision boundary and have non-zero
lagrange multipliers.
The lagrangian dual technique also has the advantage of allowing us to
handle non-linearly separable data by using the kernel trick. This involves
transforming the data into a higher-dimensional space where it can be linearly
separable, and then solving the dual problem in that space.
Once we have the Lagrangian form of the problem, we can take the partial
derivative of the function with respect to each variable and set it to zero. This
will give us a set of equations that can be solved for the Lagrange multipliers.
Once we have the Lagrange multipliers, we can use them to find the optimal
solution to the original problem. This is done by substituting the values of the
Lagrange multipliers back into the original optimization problem.
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS
Bayesian Learning
Bayes Theorem : It is a cornerstone of the Bayesian learning method. It Is a way to calculate
the posterior probability P(h|D) from the prior probability P(h, together with P(D) and P(D|h).
In many learning scenarios the learner consider some set of candidate hypotheses H and is
interested in finding the most probable hypothesis h€H given the observed data D.
In any such maximal probable hypothesis is called a maximum a posterior (MAP).
We can determine the MAP hypotheses by using bayes theorem to calculate the posterior
capability of each candidate hypothesis.
More precisely, we will say that hMAP is a maximum hypothesis provided.
The Bayes Optimal Classifier is a probabilistic model that predicts the most likely outcome
for a new situation. In this blog, we’ll have a look at Bayes optimal classifier and Naive
Bayes Classifier.
The Bayes Theorem, which provides a systematic means of computing a conditional
probability, is used to describe it. It’s also related to Maximum a Posteriori (MAP), a
probabilistic framework for determining the most likely hypothesis for a training
dataset.
Take a hypothesis space that has 3 hypotheses h1, h2, and h3.
The posterior probabilities of the hypotheses are as follows:
h1 -> 0.4
h2 -> 0.3
h3 -> 0.3
Hence, h1 is the MAP hypothesis. (MAP => max posterior).
Suppose a new instance x is encountered, which is classified negative by h2 and h3 but
positive by h1.
Taking all hypotheses into account, the probability that x is positive is .4 and the
probability that it is negative is therefore .6.
The classification generated by the MAP hypothesis is different from the most probable
classification in this case which is negative.
If the new example’s probable classification can be any value vj from a set V, the
probability P(vj/D) that the right classification for the new instance is vj is merely
The denominator is omitted since we’re only using this for comparison and all the values
of P(vj/D) will have the same denominator. The value vj , for which P (vj /D) is maximum,
is the best classification for the new instance.
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS
A Bayes optimal classifier is a system that classifies new cases according to Equation. This
strategy increases the likelihood that the new instance will be appropriately classified.
The MAP theory, therefore, argues that the robot should proceed forward (F). Let’s see
what the Bayes optimal procedure suggests.
Thus, the Bayes optimal procedure recommends the robot turn left.
Gibbs Algorithm :
• Bayes Optimal is quite costly to apply. It computes the posterior probabilities for every
hypothesis in and combines the predictions of each hypothesis to classify each new
instance
• An alternative (less optimal) method:
The Bayesian approach to classify the new instance to assign the most probable target value.
VMAP given the attribute value a1,a2,………….an that describes the instance.
PROBLEMS
Solution :
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS
2. Estimate conditional probabilities of each attributes { Color , legs , height , Smiley } for the
special clauses { M , H } using the data table given below. Using this probabilities estimate
the probability values for the new instance.
(Color = Green , legs = 2, height = tall , Smiley = no)
Solution:
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS
3. Classify new instance Red , SUV , Domestic as stolen or not based on given below data
set.
Record A B C Class
1 0 0 0 +
2 0 0 1 -
3 0 1 1 -
4 0 1 1 -
5 0 0 1 +
6 1 0 1 +
7 1 0 1 -
8 1 0 1 -
9 1 1 1 +
10 1 0 1 +
Solution :
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS
EXPECTATION MAXIMIZATION :
In the real world applications of ML , it is very common that there are many relevant features
available for the learning but only a small subset of them are observable.
ALGORITHM :
USES :
• Used to fill missing data in a sample.
• Used as a basis of unsupervised learning of clusters.
• Used for the purpose of estimating the parameters of within Hidden Markov Model
(HMM).
MACHINE LEARNING UNIT-III PROBABILISTIC AND STOCHASTIC MODELS
FLOWCHART :
EXPECTATION-STEP
MAXIMIZATION-STEP
NO IS
CONVERGENCE
EGENCE
YES
STOP
ADVANTAGES :
DISADVANTAGES:
Hierarchical Clustering:
• Hierarchical Clustering is basically an unsupervised learning technique which involves creating clusters in a
predefined order. The clusters are ordered in a top-down manner or bottom-up manner.
• In this type of clustering, similar clusters are grouped together and are arranged in a hierarchical manner.
• It can be further divided into two types of Agglomerative hierarchical clustering and Divisive hierarchical
clustering.
• In this clustering we link the pairs of clusters all the data objects in a hierarchical manner.
Non-Hierarchical Clustering:
• Non-Hierarchical Clustering involves formation of new clusters by merging or splitting the clusters.
• It does not follow a tree like structure like in hierarchical clustering.
• This technique groups the data in order to maximize or minimize some evaluation criteria.
• In this method the partitions are named such that non-overlapping groups having more hierarchical
relationships between themselves.
• K-means clustering is an effective way of non-hierarchical clustering.
Agglomerative Clustering:
• It is also known as bottom-up approach or hierarchical agglomerative clustering (HAC).
• A structure that ismore informative than the unstructured set of clusters written by flat clustering.
• This clustering algorithm does not require us to specify the number of clusters.
• Bottom-up algorithm treats each data has a singleton-clusters at the outset and then successively
agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all the
data.
Divisive Clustering:
• Also known as top-down approach. This algorithm also does not require to specify the number of clusters.
• Top-down clustering requires the method for splitting of clusters that contains the whole data and proceed
by splitting clusters recursively until individual data has been specified into singleton clusters.
K-means Clustering
Is a popular unsupervised machine learning algorithm used for clustering or grouping similar data points in a
dataset.The algorithm tries to partition a given dataset into a predefined number of clusters (K), where each data
point belongs to the cluster with the closest mean value.
K-means Algorithm for Clustering.
• K-means algorithm is unsupervised learning algorithm.
• Given a dataset of item, with certain featuresand values for three features, the algorithm will categorize the
items into k-groups or clusters of similarity.
• To Calculate the similarity, we can use the Euclidean distance, manhatton distance, Hamming distance, cosine
distance as measurement.
• The pseudocode for implementing k-means algorithm is given below
Input algorithm k-means [k number of clusters, D-desktop data points]
1) Choose K number of random data points as initial centroids [cluster centres].
2) Repeat till cluster centrestabilize:
a) Allocate each point in D to the nearest of kth centroids.
b) Compute Centroids for the cluster using all points in the cluster.
Algorithm
1)Initialize the list of clusters to accommodate the cluster consisting of all point
2) Repeat
Discard a cluster from the list of clusters {perform several trial bisections of selected cluster trial}
for i=1 do number of trials:
bisect the selected clusters using basicK-mean
End for
Select the 2 cluster from the bisection with the least total SSE [Sum of squared Errors]. Until the list of clusters
contain k-clusters.
Advantages of K-Medoid
1. K- medoids algorithm is robust to outliers and noise as the medoids are the most central data point of the
cluster, such that its distance from other points is minimal.
2. K-Medoids algorithm can be used with arbitrarily chosen dissimilarity measures (e.g., cosine similarity) or any
distance metric.
Disadvantages of K-Medoid
1. The K-medoids algorithm has large time complexity. Therefore, it works efficiently for small datasets but
doesn't scale well for large datasets.
2. The K-Medoid algorithm is not suitable for clustering non-spherical (arbitrary-shaped) groups of objects.
Chapter2:- Association Mining
Apriori Algorithm:-
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to work on
the databases that contain transactions. With the help of these association rule, it determines how strongly
or how weakly two objects are connected. This algorithm uses a breadth-first search and Hash Tree to
calculate the itemset associations efficiently. It is the iterative process for finding the frequent itemsets
from the large dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used for market basket
analysis and helps to find those products that can be bought together. It can also be used in the healthcare
field to find drug reactions for patients.
Frequent Itemset-
Frequent itemsets are those items whose support is greater than the threshold value or user-specified
minimum support. It means if A & B are the frequent itemsets together, then individually A and B should
also be the frequent itemset.
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two transactions, 2 & 3 are
the frequent itemsets.
Steps for Apriori Algorithm
Below are the steps for the apriori algorithm:
Step-1: Determine the support of itemsets in the transactional database, and select the minimum support
and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum or selected
support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the threshold or
minimum confidence.
Step-4: Sort the rules as the decreasing order of lift.
Solution:
Step-1: Calculating C1 and L1:
In the first step, we will create a table that contains support count (The frequency of each itemset
individually in the dataset) of each itemset in the given dataset. This table is called the Candidate set or C1.
Now, we will take out all the itemsets that have the greater support count that the Minimum Support (2). It
will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support, except the E, so E
itemset will be removed.
Again, we need to compare the C2 Support count with the minimum support count, and after comparing,
the itemset with less support count will be eliminated from the table C2. It will give us the below table for
L2
Now we will create the L3 table. As we can see from the above C3 table, there is only one combination of
itemset that has support count equal to the minimum support count. So, the L3 will have only one
combination, i.e., {A, B, C}.
The association rule learning is one of the very important concepts of machine learning, and it is employed
in Market Basket analysis, Web usage mining, continuous production, etc. Here market basket analysis is a
technique used by the various big retailer to discover the associations between items. We can understand it by taking
an example of a supermarket, as in a supermarket, all products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these products are stored
within a shelf or mostly nearby. Consider the below diagram:
Association rule learning can be divided into below mentioned types of algorithms:
• Apriori Algorithm
• F-P Growth Algorithm
Support-
In data mining support refers to the directive frequency of an itemset in a dataset.
For Example, If an item set occurs in 5% of transaction in a dataset, it has a support of 5%.
Support is offen used as a threshold for identifying frequent itemset in a dataset,which can be used to
generate association rules.
In general the support of an itemset can be calculated using a formula:
Support(X)=No of transaction containing(x)/Total No of transaction .
Where x is the item set for which you are calculating the support.
Confidence-
In data mining confidence is a measure of the reliability or support for a given association rule. It is defined
as the proportion of cases in which the association rules holds true or in other words the % of times that
the items in the antecedent appears in the same transaction as the items in the consequent.
For Example, Suppose we have dataset of 1000 transaction and the itemset{A,B} appears in 100 of those
transaction, the itemset A appears in 200 of those transaction. The confidence of that rule would be
calculated as follows.,
Confidence(X^Y)=No of transaction containing(x^y)/ No of transaction containing(x).
Where x & y are the itemsets for which you are calculating the confidence.
FP growth- FP tree
The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method for mining the complete set
of frequent patterns by pattern fragment growth, using an extended prefix-tree structure for storing compressed and
crucial information about frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that his
method outperforms other popular methods for mining frequent patterns, e.g. the Apriori Algorithm and the Tree
Projection. In some later works, it was proved that FP-Growth performs better than other methods,
including Eclat and Relim. The popularity and efficiency of the FP-Growth Algorithm contribute to many studies that
propose variations to improve its performance.
FP Growth Algorithm
The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate generations, thus
improving performance. For so much, it uses a divide-and-conquer strategy. The core of this method is the usage of a
special data structure named frequent-pattern tree (FP-tree), which retains the item set association information.
This algorithm works as follows:
➢ First, it compresses the input database creating an FP-tree instance to represent frequent items.
➢ After this first step, it divides the compressed database into a set of conditional databases, each associated
with one frequent pattern.
➢ Finally, each such database is mined separately.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short patterns and then
concatenating them into the long frequent patterns.
In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with this problem is to
partition the database into a set of smaller databases (called projected databases) and then construct an FP-tree
from each of these smaller databases.
FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative information about
frequent patterns in a database. Each transaction is read and then mapped onto a path in the FP-tree. This
is done until all transactions have been read. Different transactions with common subsets allow the tree to
remain compact because their paths overlap.
A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP tree is to
mine the most frequent pattern. Each node of the FP tree represents an item of the item set.
The root node represents null, while the lower nodes represent the item sets. The associations of the
nodes with the lower nodes, that is, the item sets with the other item sets, are maintained while forming
the tree.
Algorithm by Han
Algorithm 1: FP-tree construction
Input: A transaction database DB and a minimum support threshold?
Output: FP-tree, the frequent-pattern tree of DB.
Method: The FP-tree is constructed as follows.
1. The first step is to scan the database to find the occurrences of the itemsets in the database. This step is the
same as the first step of Apriori. The count of 1-itemsets in the database is called support count or frequency
of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root is represented by
null.
3. The next step is to scan the database again and examine the transactions. Examine the first transaction and
find out the itemset in it. The itemset with the max count is taken at the top, and then the next itemset with
the lower count. It means that the branch of the tree is constructed with transaction itemsets in descending
order of count.
4. The next transaction in the database is examined. The itemsets are ordered in descending order of count. If
any itemset of this transaction is already present in another branch, then this transaction branch would share
a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions. The common node and new
node count are increased by 1 as they are created and linked according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined first, along with the links
of the lowest nodes. The lowest node represents the frequency pattern length 1. From this, traverse the path
in the FP Tree. This path or paths is called a conditional pattern base.
A conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the
lowest node (suffix).
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path. The itemsets meeting the
threshold support are considered in the Conditional FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.
Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects and sorts the set of
frequent items, and the second constructs the FP-Tree.
Example
Table 1:
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
Item Count
I2 5
I1 4
I3 4
I4 4
Build FP Tree
1. Considering the root node null.
2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1}, where I2 is linked as a child,
I1 is linked to I2 and I3 is linked to I1.
3. T2: I2, I3, and I4 contain I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and I4 is linked to I3. But this
branch would share the I2 node as common as it is already used in T1.
4. Increment the count of I2 by 1, and I3 is linked as a child to I2, and I4 is linked as a child to I3. The count is
{I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node. Hence it will be
incremented by 1. Similarly I1 will be incremented by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2},
{I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2}, {I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.
Dimensionality Reduction
o Dimensionality reduction is the process of reducing the number of features (or dimensions)
in a dataset while retaining as much information as possible.
o This can be done for a variety of reasons. -To reduce the complexity of a model.
-To improve the performance of a learning algorithm. -To
make it easier to visualize the data.
• A 3-D classification problem can be hard to visualize.
• 3-D feature space is split into two 2-D feature spaces, and later, if found to be correlated,
the number of features can be reduced even further.
Applications:
➢ Calculation of Pseudo-inverse
➢ Rank, Range, and Null space
➢ Curve Fitting Problem..
UNIT-5
GENETIC ALGORITHMS
GENETIC ALGORITHMS
In GAS the "best hypothesis" is defined as the one that optimizes a predefined
numerical measure for the problem at hand, called the hypothesis fitness.
GA (Fitness , Fitness_threshold , p , r , m)
Fitness: A function that assigns an evaluation score, given a hypothesis.
1. Select: probabilistically select (1 - r)p members of P to add to Ps. The probability Pr(hi) of
selecting hypothesis hi from P is given by
3. Mutate: Choose m percent of the members of P, with uniform probability. For each, invert
one randomly selected bit in its representation.
4. Update: P←ps.
A prototypical genetic algorithm is described in Table 9.1. The inputs to this algorithm include the
fitness function for ranking candidate hypotheses, a threshold defining an acceptable level of fitness
for terminating the algorithm, the size of the population to be maintained, and parameters that
determine how successor populations are to be generated: the fraction of the population to be
replaced at each generation and the mutation rate.
Notice in this algorithm each iteration through the main loop produces a new generation of
hypotheses based on the current population.
First, a certain number of hypotheses from the current population are selected for inclusion in the
next generation. These are selected probabilistically, where the probability of selecting hypothesis hi
is given by
Thus, the probability that a hypothesis will be selected is proportional to its own fitness and is
inversely proportional to the fitness of the other competing hypotheses in the current population.
Once these members of the current generation have been selected for inclu- sion in the next
generation population, additional members are generated using a crossover operation.
Crossover, defined in detail in the next section, takes two par- ent hypotheses from the current
generation and creates two offspring hypotheses by recombining portions of both parents.
The parent hypotheses are chosen proba- bilistically from the current population, again using the
probability function given by Equation (9.1).
After new members have been created by this crossover opera- tion, the new generation population
now contains the desired number of members. At this point, a certain fraction m of these members
are chosen at random, and random mutations all performed to alter these members.
This GA algorithm thus performs a randomized, parallel beam search for hypotheses that perform
well according to the fitness function. In the follow- ing subsections, we describe in more detail the
representation of hypotheses and genetic operators used in this algorithm.
Representing Hypotheses
Hypotheses in GAS are often represented by bit strings, so that they can be easily manipulated by genetic
operators such as mutation and crossover.
Example :
Consider the attribute Outlook, which can take on any of the three values Sunny, Overcast, or Rain. One
obvious way to represent a constraint on Outlook is to use a bit string of length three.
In which each bit position corresponds to one of its three possible values.
Placing a 1 in some position indicates that the attribute is allowed to take on the corresponding value.
For example :
The string 010 represents the constraint that Outlook must take on the second of these values,
Outlook = Overcast.
Similarly, the string 011 represents the more general constraint that allows two possible values, or
Example :
Consider a second attribute, Wind, that can take on the value Strong or Weak. A rule precondition such as
Outlook Wind
011 10
Rule Postconditions (such as PlayTennis = yes) can be represented in a similar fashion. Thus, an entire rule
can be described by concatenating the bit strings describing the rule preconditions, together with the bit
string describing the rule Postcondition. For example, the rule
111 10 10
where the first three bits describe the "don't care" constraint on Outlook, the next two bits describe the
constraint on Wind, and the final two bits describe the rule postcondition
Genetic Operators
The crossover operator produces two new offspring from two parent strings, by copying selected bits from
each parent. The bit at position i in each offspring is copied from the bit at position i in one of the two
parents. The choice of which parent contributes the bit for position i is determined by an additional string
called the crossover mask.
Single-point crossover:
Consider the topmost of the two offspring in this case. This offspring takes its first five bits from the first
parent and its remaining six bits from the second parent, because the crossover mask 111 1 1000000
specifies these choices for each of the bit positions. The second offspring uses the same crossover mask, but
switches the roles of the two parents. Therefore, it contains the bits that were not used by the first
offspring.
Two-point Crossover :
In two-point crossover, offspring are created by substituting intermediate segments of one parent into the
middle of the second parent string. Put another way, the crossover mask is a string beginning with no zeros,
followed by a contiguous string of n1 ones, followed by the necessary number of zeros to complete the
string. Each time the two-point crossover operator is applied, a mask is generated by randomly choosing the
integers no and nl. For instance, in the example shown in Table 9.2 the offspring are created using a mask for
which n0 = 2 and n 1 = 5. Again, the two offspring are created by switching the roles played by the two
parents.
Uniform crossover :
Uniform crossover combines bits sampled uniformly from the two parents, as illustrated in Table 9.2. In this
case the crossover mask is generated as a random bit string with each bit chosen at random and independent
of the others.
Mutation
The mutation operator produces small random changes to the bit string by choosing a single bit at
random, then changing its value.
Mutation is often performed after crossover has been applied as in our prototypical algorithm from
Table 9.1.
Some GA systems employ additional operators, especially operators that are specialized to the
particular hypothesis representation used by the system.
For example:- Grefenstette et al. (1991) describe a system that learns sets of rules for robot control. It uses
mutation and crossover, together with an operator for specializing rules. Janikow (1993) describes a system
that learns sets of rules using operators that generalize and specialize rules in a variety of directed ways (e.g.,
by explicitly replacing the condition on an attribute by "don't care").
Fitness Function
The fitness function defines the criterion for ranking potential hypotheses and for
probabilistically selecting them for inclusion in the next generation population.
If the task is to learn classification rules, then the fitness function typically has a component
that scores the classification accuracy of the rule over a set of provided training examples.
Often other criteria may be included as well, such as the complexity or generality of the rule.
when the bit-string hypothesis is interpreted as a complex procedure (e.g., when the bit
string represents a collection of if-then rules that will be chained together to control a
robotic device), the fitness function may measure the overall performance of the resulting
procedure rather than performance of individual rules.
Selection
The probability that a hypothesis will be selected is given by the ratio of its fitness to the
fitness of other members of the current population as seen in Equation (9.1). This method is
sometimes called fitness proportionate selection, or roulette wheel selection. Other methods
for using fitness to select hypotheses have also been proposed.
For example, in tournament selection, two hypotheses are first chosen at random from the
current population.
With some predefined probability p the more fit of these two is then selected, and with probability
(1 - p) the less fit hypothesis is selected.
Tournament selection often yields a more diverse population than fitness proportionate
selection.
In another method called rank selection, the hypotheses in the current population are first
sorted by fitness.
The probability that a hypothesis will be selected is then proportional to its rank in this
sorted list, rather than its fitness.
Genetic Algorithms are primarily used in optimization problems of various kinds, but they are
frequently used in other application areas as well.
In this section, we list some of the areas in which Genetic Algorithms are frequently used.
These are −
Economics − GAs are also used to characterize various economic models like the cobweb
model, game theory equilibrium resolution, asset pricing, etc.
Neural Networks − GAs are also used to train neural networks, particularly recurrent neural
networks.
Parallelization − GAs also have very good parallel capabilities, and prove to be very effective
means in solving certain problems, and also provide a good area for research.
Image Processing − GAs are used for various digital image processing (DIP) tasks as well like
dense pixel matching.
Vehicle routing problems − With multiple soft time windows, multiple depots and a
heterogeneous fleet.
Scheduling applications − GAs are used to solve various scheduling problems as well,
particularly the time tabling problem.
Machine Learning − as already discussed, genetics based machine learning (GBML) is a niche
area in machine learning.
Robot Trajectory Generation − GAs have been used to plan the path which a robot arm
takes by moving from one point to another.
Parametric Design of Aircraft − GAs have been used to design aircrafts by varying the
parameters and evolving better solutions.
DNA Analysis − GAs have been used to determine the structure of DNA using spectrometric
data about the sample.
Multimodal Optimization − GAs are obviously very good approaches for multimodal
optimization in which we have to find multiple optimum solutions.
Traveling salesman problem and its applications − GAs have been used to solve the TSP,
which is a well-known combinatorial problem using novel crossover and packing strategies.
In GABC, the clustering problem is framed as an optimization problem where the goal is to find the
optimal set of clusters and their centroids that minimize the sum of squared distances between the
data points and their respective cluster centroids. The genetic algorithm is then used to search the
solution space and find the optimal solution.
f(x) = x^2 - 6x + 5
To find the minimum value of this function, we can use GA. We start by defining a chromosome
representation, which in this case can be a binary string representing the values of x. We also need to define
the fitness function, which is simply the value of the function evaluated at a given value of x.
We can then initialize a population of candidate solutions, and use selection, crossover, and mutation
operators to evolve the population over a number of generations. The fittest individuals in each generation
are selected for reproduction, and the process continues until a satisfactory solution is found.
To solve this problem using GA, we can use a multi-objective GA approach. We start by defining a
chromosome representation, which can be a binary string representing the values of x and y. We also need
to define the fitness function, which is a vector of the values of f1(x) and f2(x).
We can then initialize a population of candidate solutions, and use selection, crossover, and mutation
operators to evolve the population over a number of generations. In each generation, we need to use a
selection operator that is capable of handling multiple objectives, such as Pareto dominance or fitness
sharing.
The fittest individuals in each generation are selected for reproduction, and the process continues until a
satisfactory set of non-dominated solutions is found. The set of non-dominated solutions represents the
trade-offs between the two objectives, and can be visualized using a Pareto front plot.
Genetic Algorithms (GA) and Gradient Descent (GD) are two different optimization techniques that can
be used to optimize a function. While GA is a population-based optimization method inspired by
biological evolution, GD is an iterative optimization method that relies on the calculation of the gradient
of the objective function.
Emulating GD/gradient ascent using GA can be achieved by representing the population as a set of
candidate solutions, and the fitness of each candidate is evaluated by calculating the value of the
objective function at that point. The objective function can be seen as the fitness function in this case.
To emulate gradient descent, the GA algorithm can be modified by using selection, crossover, and
mutation operators that bias the search towards the direction of steepest descent. For example,
selection can be biased towards selecting individuals with lower fitness values, and mutation can be
biased towards decreasing the value of certain genes to move towards the minimum point.
Similarly, to emulate gradient ascent, the GA algorithm can be modified by using selection, crossover,
and mutation operators that bias the search towards the direction of steepest ascent. For example,
selection can be biased towards selecting individuals with higher fitness values, and mutation can be
biased towards increasing the value of certain genes to move towards the maximum point.
However, it's worth noting that while GA and GD are both optimization techniques, they differ in terms
of their efficiency and suitability for different types of problems. GD is generally more efficient for
functions with continuous and differentiable derivatives, while GA can handle non-differentiable
functions or problems with a large number of local optima. Therefore, it's important to choose the
appropriate optimization technique based on the nature of the problem.