Ad8552 ML Unit Ii
Ad8552 ML Unit Ii
2
Pleaseread this disclaimerbefore proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
3
DIGITAL NOTES ON
AD8552 Machine Learning
Batch/Year : 2020-2024/III
Date : 08-08-2022
Signature :
4
Table of Contents
S NO CONTENTS SLIDE NO
1 Contents 5
2 Course Objectives 7
9
3 Pre Requisites (Course Names with Code)
5 Course Outcomes 12
7 Lecture Plan 17
10 Assignments 73
11 Part A (Q & A) 75
12 Part B Qs 80
16 Assessment Schedule 90
5
Course Objectives
COURSE OBJECTIVES
7
PRE REQUISITES
PREREQUISITE
11
Course Outcomes
Course Outcomes
Cognitive/
Affective
Expected
Course Level of
Course Outcome Statement Level of
Code the
Course Attainment
Outcome
Course Outcome Statements in Cognitive Domain
13
CO – PO/PSO Mapping
CO- PO/PSO Mapping
2 2 1 - - - - - - - - -
CO2 2 2 1
2 1 1 - - - - - - - - -
CO3 2 1 -
2 2 1 - - - - - - - - -
CO4 2 1 1
2 2 1 - - - - - - - - -
CO5 2 1 1
15
Lecture Plan
Unit II
LECTURE PLAN
UNIT – II
Taxonomy level
Proposed date
Actual Lecture
pertaining CO
No of periods
Mode of
delivery
Date
S No Topics
1 K2
1 PPT
Linear methods – Regression 24.08.2022
2
1 25.08.2022 K2 PPT
Classification
3
Perceptron and Neural 1 K2 PPT
networks 26.08.2022
4
1 K2 PPT
Decision trees 01.09.2022
5 02.09.2022
2 05.09.2022 K2 PPT
Support vector machines
6
2 06.09.2022 K2 PPT
Probabilistic model 07.09.2022
7
2 08.09.2022 K2
Unsupervised learning 09.09.2022
8
1 K2 PPT
Featurizatio 13.09.2022
17
Activity Based Learning
Unit II
ACTIVITY BASED LEARNING
(MODEL BUILDING/PROTOTYPE)
S NO TOPICS
Cross word Puzzle
Down
1. A post-prediction adjustment,
typically to
account for prediction bias.
2. A TensorFlow programming
environment in
which operations run immediately.
4. Obtaining an understanding of data
by
considering samples, measurement, and
visualization.
5. An ensemble approach to finding the
decision tree that best fits the training data
Across 7. state-action value function
3. In machine learning, a mechanism 8. Loss function based on the absolute
for value of
bucketing categorical data the difference between the values that a
6. The primary algorithm for model
performing is predicting and the actual values of the
labels
gradient descent on neural networks.
10. A metric that your algorithm is
9. Abbreviation for independently and trying to
identically distributed optimize.
12. The more common label in a class- 11. The recommended format for
imbalanced dataset. saving and
13. Applying a constraint to an recovering TensorFlow models.
algorithm to 14. A statistical way of comparing two
ensure one or more definitions of fairness (or
more) techniques, typically an incumbent
are
against a new rival.
satisfied. 15. When one number in your model
18. A process used, as part of training, becomes
to a NaN during training, which causes many
evaluate the quality of a machine learning or
model using the validation set. all other numbers in your model to
19. A coefficient for a feature in a eventually
linear become a NaN.
16. Q-learning In reinforcement
model, or an edge in a deep network.
learning,
20. A column-oriented data analysis implementing Q-learning by using a table to
API. store the Q-functions
21. Abbreviation for generative 17. A popular Python machine learning
adversarial API
network
19
Lecture Notes – Unit 2
UNIT II MACHINE LEARNING METHODS 11
Linear methods – Regression -Classification –Perceptron and Neural networks – Decision trees
–Support vector machines – Probabilistic models ––Unsupervised learning – Featurization
1. Introduction
In general machine learning algorithms are divided into two types:
1. Supervised learning algorithms
2. Unsupervised learning algorithms
Supervised learning algorithms deal with problems that involve learning
with guidance. In other words, the training data in supervised learning methods need labelled
samples. For example for a classification problem, we need samples with class label, or for a
regression problem we need samples with desired output value for each sample, etc. The
underlying mathematical model then learns its parameters using the labelled samples and
then it is ready to make predictions on samples that the model has not seen, also called as
test samples.
Unsupervised learning deals with problems that involve data without labels.
In some sense one can argue that this is not really a problem in machine learning as there is
no knowledge to be learned from the past experiences. Unsupervised approaches try to find
some structure or some form of trends in the training data.A common example of
unsupervised learning is clustering.
Linear models are the machine learning models that deal with linear data or
nonlinear data that be somehow transformed into linear data using suitable transformations.
Although these linear models are relatively simple, they illustrate fundamental concepts in
machine learning theory and pave the way for more complex models. These linear models are
the focus of this Unit
The models that operate on strictly linear data are called linear models, and the models that
use some nonlinear transformation to map original nonlinear data to linear data and then
process it are called as generalized linear models.
relationship between the input and output can be described using linear equations
21
For unsupervised learning,the concept of linearity implies that the
distributions that we can impose on the given data are defined using linear equations. It is
important to note that the notion of linearity does not imply any constraints on the
dimensions. Hence we can have multivariate data that is strictly linear.
In case of one-dimensional input and output, the equation of the relationship
would define a straight line in two-dimensional space. In case of two-dimensional data
with one-dimensional output, the equation would describe a two-dimensional plane in three-
dimensional space and so on.
1.1. Linear Regression
Linear regression is a classic example of strictly linear models. It is also called as
polynomial fitting and is one of the simplest linear methods in machine learning.
1.2 Defining the Problem
The method of linear regression defines the following relationship between input
xi and predicted output 𝑦ො in the form of linear equation as:
𝑛
𝑦ො = 𝑥𝑖𝑗 ⋅ 𝑤𝑗 + 𝑤0 (1)
𝑗=1
𝑦ො is the predicted output when the actual output is 𝑦𝑖 . 𝑤𝑖 i = 1,...,p are called as the weight
parameters and 𝑤0 is called as the bias. Evaluating these parameters is the objective of
training. The same equation can also be written in matrix form as
𝑦ො = 𝑥 𝑇 ⋅ 𝑤 + 𝑤0 (2)
Where X= [𝑥𝑖𝑇 ], i= 1,……..,p and w=[𝑤𝑖 ],i=1,…,n. The problem is to find the values of all
weight parameters using the training data.
1.3 Solving the Problem
Most commonly used method to find the weight parameters is to minimize the
mean square error between the predicted values and actual values. It is called as least
squares method. When the error is distributed as Gaussian, this method yields an estimate
called as maximum likelihood estimate or MLE. This is the best unbiased estimate one can
find given the training data. The optimization problem can be defined as
𝑚ⅈ𝑛 𝑦𝑖 − 𝑦
ො𝑖 2 (3)
Expanding the predicted value term, the full minimization problem to find the optimal
weight vector wlr can be written as
𝜌
𝑛 2
𝑤 𝑙𝑟 = arg mⅈn 𝑦𝑖 − 𝑥𝑖𝑗 ⋅ 𝑤𝑗 − 𝑤0 (4)
w 𝑖=𝑖
In general, the solution obtained by solving Eq. 4 gives the best unbiased estimate, but in
some specific cases, where it is known that the error distribution is not Gaussian or the
optimization problem is highly sensitive to the noise in the data, above procedure can result
in what is called as overfitting. In such cases, a mathematical technique called
regularization is used.
2.1 Regularization
𝑛 2
𝑤𝑗 ≤𝑡 (5)
𝑗=1
𝜌
𝑛 2 𝑛 2
𝑤 𝑅ⅈ𝑑𝑔ⅇ = arg mⅈn 𝑦𝑖 − 𝑥𝑖𝑗 ⋅ 𝑤𝑗 − 𝑤0 + 𝜆 𝑤𝑗 (6)
𝑗=1
w 𝑖=𝑖 𝑗=1
𝑛 2
𝑤𝑗 ≤𝑡 (7)
𝑗=1
where t is a constraint parameter. Using Lagrangian approach, the joint optimization problem
can be written as
𝜌
𝑛 2
𝑛
𝑤 Lasso = arg mⅈn 𝑦𝑖 − 𝑥𝑖𝑗 ⋅ 𝑤𝑗 − 𝑤0 + 𝜆 𝑗=1 𝑤𝑖 2 (8)
w 𝑖=𝑖 𝑗=1
The function that performs such transformation is called as basis function or link
function. For example, logistic regression uses logistic function as basis function to
transform the nonlinearity into linearity.
Logistic function is a special case where it also maps the output between range of [0-1],
which is equivalent to a probability density function. Also, sometimes the response between
input and output is monotonic, but not necessarily linear due to discontinuities.
Such cases can also be converted into linear space with the use of specially constructed
basis functions. We will discuss logistic regression to illustrate the concept of GLM.
As the output -α - α, it is also better suited for classis constrained between [0, 1], it can
be treated as a probabilistic measure. Also, due to symmetrical distribution of the output
of logistic function between ification problems. Due to its validity in regression as well
as classification problems, unlike the linear regression, logistic regression is the most
commonly used approach in the field of machine learning
as default first alternative.
2.4 k-Nearest Neighbor (KNN) Algorithm
It is one of the simplest algorithms in the field of machine learning, and is apt to discuss it
here in the first chapter of this part. KNN is also a generic method that can be used as
classifier or regressor.
Fig. 2.1 Figure showing a distribution of input data and showing the
concept of finding nearest neighbors
This can be written in equation form as
𝑦ො = σ𝑘𝑖=1 𝑦𝑖 Τ𝑘 (11)
Where 𝑦𝑖 is the output value of the i th nearest neighbor. As can be seen this is one of the
simplest way to define the input to output mapping
3.1 Introduction
Frank Rosenblatt (1928 – 1971) was an American psychologist notable in the field of
Artificial Intelligence.
In 1957 he started something really big. He "invented" a Perceptron program, on an
IBM 704 computer at Cornell Aeronautical Laboratory.
Scientists had discovered that brain cells (Neurons) receive input from our senses by
electrical signals.
The Neurons, then again, use electrical signals to store information, and to make
decisions based on previous input.
Frank had the idea that Perceptrons could simulate brain principles, with the ability to
learn and make decisions.
A Perceptron is an Artificial Neuron
It is the simplest possible Neural Network
Neural Networks are the building blocks of Machine Learning
3.2 Perceptron
The original Perceptron was designed to take a number of binary inputs, and produce
one binary output (0 or 1).
The idea was to use different weights to represent the importance of each input, and
that the sum of the values should be greater than a threshold value before making a
decision like true or false (0 or 1).
Geometrically a single layered perceptron with linear mapping represents a linear plane
in n-dimensions. In n-dimensional space the input vector is represented as (𝑥1 , 𝑥2 … , 𝑥𝑛)
or x. The coefficients or weights in n-dimensions are represented as ((𝑤1 , 𝑤2 , … , w𝑛 ) or
w. The equation of perceptron in the n-dimensions is then written in vector form as
x.w = y
Single-layer perceptron is an artificial neural net that comprises one layer for
computation. In such a perceptron type, the neural network performs the computation
directly from the input layer to the output. Such a perceptron does not contain any
hidden layer.
In such a type of perceptron, the input nodes are directly connected to the final layer. It
is easy for TensorFlow to run such algorithms. A node in the next layer carries the
weighted sum of various other inputs.
This is what a single layer perceptron will look like
The network shown in Fig. 3.3 also emphasizes another important aspect of MLP called as
feedforward operation. The information that is entered from the input propagates through
each layer towards the output. There is no feedback of the information from any layer
backwards when the network is used for predicting the output in the form of regression or
classification.
Fig. 3.4 Activation function sign Fig. 3.5 Activation function tanh
3.3.3 Training MLP
During the training process, the weights of the network are learned from the labelled
training data. Conceptually the process can be described as:
1. Present the input to the neural network.
2. All the weights of the network are assigned some default value.
3. The input is transformed into output by passing through each node or neuron in each layer.
4. The output generated by the network is then compared with the expected output or label.
5. The error between the prediction and label is then used to update the weights of each node.
6. The error is then propagated in backwards direction through every layer, to update the
weights in each layer such that they minimize the error.
Hidden Layers are not directly connected with inputs and outputs.Each layer in MLP
transforms the input to a new dimensional space.
The hidden layers can have higher dimensionality than the actual input and thus they
can transform the input into even higher dimensional space
Hidden layers, simply put, are layers of mathematical functions each designed to
produce an output specific to an intended result.
Hidden layers allow for the function of a neural network to be broken down into specific
transformations of the data. Each hidden layer function is specialized to produce a
defined output
Radial basis function networks RBFN or radial basis function neural networks RBFNN are
a variation of the feedforward neural networks (we will call them as RBF networks to
avoid confusion).
The RBF networks are characterized by three layers, input layer, a single hidden layer,
and output layer.
https://mccormickml.com/2013/08/15/radial-basis-function-network-rbfn-tutorial/
The input and output layers are linear weighing functions, and the hidden layer has a radial
basis activation function instead of sigmoid type activation function that is used in
traditional MLP. The basis function is defined as
𝑓𝑅𝐵𝐹 𝑥 = ⅇ−𝛽 𝑥− 2
Above equation is defined for a scalar input µ is called as center and ß represents the
spread or variance of the radial basis function. It lies in the input space. Figure 3.7 shows
the plot of the basis function. This plot is similar to Gaussian distribution.
Consider that the desired values of output form n number of clusters for the corresponding
clusters in the input space. Each node in the hidden layer can be thought of as a
representative of each transformation from input cluster to output cluster.
As can be seen from Fig. 3.7, the value of radial basis function reduces to 0 rather quickly
as the distance between the input and the center of the radial basis function µ increases
with respect to the spread ß.
Thus RBF network as a whole maps the input space to output space by linear combination
of outputs generated by each hidden RBF node.
It is important to choose these cluster centers carefully to make sure the input space is
mapped uniformly and there are no gaps.
If requirements for the RBF network are followed, it produces accurate predictions
3. 5 Overfitting and Regularization
Neural Network scope to improve the performance for the given training data by
increasing the complexity of the network. Complexity can be increased by manipulating
various factors like
1. Increasing number of hidden layers
2. Increasing the nodes in hidden layers
3. Using complex activation functions
4. Increasing the training epochs
Such improvements in training performance with arbitrary increase in complexity
typically lead to overfitting. Overfitting is a phenomenon where we try to model the
training data so accurately that in essence we just memorize the training data rather than
identifying the features and structure of it. Such memorization leads to significantly worse
performance on unseen data. However determining the optimal threshold where the
optimization should be stopped to keep the model generic enough is not trivial.
• For example, when the model learns signals as well as noises in the training data but
couldn’t perform appropriately on new data upon which the model wasn’t trained, the
condition/problem of overfitting takes place.
• Overfitting simply states that there is low error with respect to training dataset, and
high error with respect to test datasets.
• Regularization is the most used technique to penalize complex models in machine
learning, it is deployed for reducing overfitting (or, contracting generalization errors) by
putting network weights small.
3.5.1 L1 and L2 Regularization
When you have a large number of features in your data set, you may wish to create a
less complex, more parsimonious model. Two widely used regularization techniques
used to address overfitting and feature selection are L1 and L2 regularization.
L1 Regularization, also called a lasso regression, adds the “absolute value of
magnitude” of the coefficient as a penalty term to the loss function.
L2 Regularization, also called a ridge regression, adds the “squared magnitude”
of the coefficient as the penalty term to the loss function.
Above Equ. Show the updated cost function C(x) use of L1 and L2 type
of regularizations to reduce the overfitting.
L(x) is the loss function that is dependent on the error in prediction, while W stand for the
vector of weights in the neural network.
The L1 norm tries to minimize the sum of absolute values of the weights while the L2
norm tries to minimize the sum of squared values of the weights.
The L1 regularization requires less computation but is less sensitive to strong outliers
,as well as it is prone to making all the weights zero.
L2 regularization is overall a better metric and provides slow weight decay towards zero,
but is more computation intensive.
4. Decision Tree
4.1 Introduction
In decision analysis, a decision tree can be used to visually and explicitly represent
decisions and decision making. It uses a tree-like model of decisions. Though a commonly
used tool in data mining for deriving a strategy to reach a particular goal, its also widely
used in machine learning.
4.4 Regression
𝑡 𝑥 = 𝑟𝑘 ∀𝑥𝑖 ∈ 𝑅𝑘
where 𝑟𝑘 is a constant value of output in region 𝑅𝑘
If we define the optimization problem as minimizing the mean square error,
then simple calculation would show that the estimate for 𝑟𝑘 is given by
Let us denote largetreeasT0 . Then the algorithm must apply a pruning technique to
reduce the tree size to find the optimal tradeoff that captures the most of the structure in
the data without overfitting it. This is achieved by using squared error node impurity
measure optimization.
Fig. 4.4 The plot of decision metrics for a case of 2 class problem. X-axis shows the
proportion in class 1. Curves are scaled to fit, without loss of generality
As the plot in Fig. 4.4 shows, this is a smooth function of the proportion and is continuously
differentiable and can be safely used in optimization.
4.6.3 Cross-Entropy or Deviance
Cross-entropy is an information-theoretic metric defined as
4.7 CHAID
Chi-square automatic interaction detector or CHAID is a decision tree technique that
derives its origin in statistical chi-square test for goodness of fit. It was first published
by G. V. Kass in 1980, but some parts of the technique were already in use in 1950s.
This test uses the chi-square distribution to compare a sample with a population and
predict at desired statistical significance whether the sample belongs to the population.
CHAID technique uses this theory to build a decision tree.
4.8.1 Steps
1. Start with the training data.
2. Choose the metric of choice (Gini index or cross-entropy).
3. Choose the root node, such that it splits the data with optimal values of metrics into two
branches.
4. Split the data into two parts by applying the decision rule of root node.
5. Repeat the steps 3 and 4 for each branch.
6. Continue the splitting process till leaf nodes are reached in all the branches with
predefined stop rule.
Ensemble methods, which combines several decision trees to produce better predictive
performance than utilizing a single decision tree. The main principle behind the ensemble
model is that a group of weak learners come together to form a strong learner.
There are three main types of ensembles:
1. Bagging
2. Random forest
3. Boosting
4.9 Bagging Ensemble Trees
The term bagging finds it origins in Bootstrap Aggregation. Coincidentally, literal meaning
of bagging, which means putting multiple decision trees in a bag is not too far from the
way the bagging techniques work. Bagging technique can be described using following
steps:
1. Split the total training data into a predetermined number of sets with random sampling
with replacement. The term With replacement means that same sample can appear in
multiple sets. Each sample is called as Bootstrap sample.
2. Train decision tree using CART or ID3 method using each of the data sets.
3. Each learned tree is called as a weak learner.
4. Aggregate all the weak learners by averaging the outputs of individual learners for the
case of regression and aggregate all the individual weak learners by voting for the case of
classification. The aggregation steps involve optimization, such that prediction error is
minimized.
5. The output of the aggregate or ensemble of the weak learners is considered as the
final output.
4.10 Random Forest Tree
Random Forest is a flexible, easy to use machine learning algorithm that
produces, even without hyper-parameter tuning, a great result most of the time.
It is also one of the most used algorithms, because it’s simplicity and the fact
that it can be used for both classification and regression tasks.
Random Forest is a supervised learning algorithm. It creates a forest and makes
it somehow random.
The forest it builds, is an ensemble of Decision Trees, most of the time trained
with the “bagging” method.
The general idea of the bagging method is that a combination of learning
models increases the overall result.
Random forest builds multiple decision trees and merges them together
to get a more accurate and stable prediction.
One big advantage of random forest is, that it can be used for both classification
and regression problems, which form the majority of current machine learning
systems.
Below you can see how a random forest would look like with two trees
Support Vector Machine is responsible for finding the decision boundary to separate
different classes and maximize the margin.
Margins are the (perpendicular) distances between the line and those dots closest to the
line.
Hyperplane is an (n minus 1)-dimensional subspace for an n-dimensional space. For a 2-
dimension space, its hyperplane will be 1-dimension, which is just a line. For a 3-
dimension space, its hyperplane will be 2-dimension, which is a plane that slice the cube.
SVM Algorithm
Separable case – Infinite boundaries are possible to separate the data into two classes.
Non Separable case – Two classes are not separated but overlap with each other.
Separable case SVM
Let’s understand the working of SVM using an example. Suppose we have a dataset that
has two classes (green and blue). We want to classify that the new data point as either blue
or green.
To classify these points, we can have many decision boundaries, but the
question is which is the best and how do we find it? NOTE: Since we are plotting the data
points in a 2-dimensional graph we call this decision boundary a straight line but if we
have more dimensions, we call this decision boundary a “hyperplane”.
The best hyperplane is that plane that has the maximum distance from
both the classes, and this is the main aim of SVM. This is done by finding different
hyperplanes which classify the labels in the best way then it will choose the one which is
farthest from the data points or the one which has a maximum margin.
Any Hyperplane can be written mathematically as above
The dots above this line, are those x1, x2 satisfy the formula above
The distance between either side of the dashed line to the solid line is the margin
Non-Separable Case
In the linearly separable case, SVM is trying to find the hyperplane that
maximizes the margin, with the condition that both classes are classified correctly. But in
reality, datasets are probably never linearly separable, so the condition of 100% correctly
classified by a hyperplane will never be met.
SVM address non-linearly separable cases by introducing two concepts:
Soft Margin and Kernel Tricks.
Let’s use an example. If I add one red dot in the green cluster, the dataset becomes linear
non separable anymore.
Two solutions to this problem:
1.Soft Margin: try to find a line to separate, but tolerate one or few misclassified dots (e.g.
the dots circled in red)
2.Kernel Trick: try to find a non-linear decision boundary
Soft Margin
Two types of misclassifications are tolerated by SVM under soft margin:
1.The dot is on the wrong side of the decision boundary but on the correct side/ on the
margin (shown in left)
2.The dot is on the wrong side of the decision boundary and on the wrong side of the margin
(shown in right)
Applying Soft Margin, SVM tolerates a few dots to get misclassified and tries to balance the
trade-off between finding a line that maximizes the margin and minimizes the
misclassification.
Kernels in Support Vector Machine
The most interesting feature of SVM is that it can even work with a non-linear dataset and
for this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we
have a dataset like this:
Here we see we cannot draw a single line or say hyperplane which can classify the points
correctly. So what we do is try converting this lower dimension space to a higher dimension
space using some quadratic functions which will allow us to find a decision boundary that
clearly divides the data points. These functions which help us do this are called Kernels and
which kernel to use is purely determined by hyperparameter tuning.
So we basically need to find X12 , X22 and X1.X2, and now we can see that 2 dimensions got
converted into 5 dimensions.
2. Sigmoid kernel
We can use it as the proxy for neural networks. Equation is:
It is just taking your input, mapping them to a value of 0 and 1 so that they can be
separated by a simple straight line.
3. RBF kernel
What it actually does is to create non-linear combinations of our features to lift your
samples onto a higher-dimensional feature space where we can use a linear decision
boundary to separate your classes It is the most used kernel in SVM classifications, the
following formula explains it mathematically:
where,
1. ‘σ’ is the variance and our hyperparameter
2. ||X₁ – X₂|| is the Euclidean Distance between two points X₁ and X₂
https://www.analyticsvidhya.com/blog/2021/10/support-vector-
machinessvm-a-complete-guide-for-beginners/
6. Probabilistic Models
Probabilistic methods try to assign some form of uncertainty to the unknown variables and
some form of belief probability to known variables and try to find the unknown values using
the extensive library of probabilistic models. The probabilistic models are mainly classified into
two types:
1. Generative
2. Discriminative
The difference between the two types is given in terms of the probabilities
that they deal with. If we have an observable input X and observable output Y , then the
generative models try to model the joint probability P(X; Y), while the discriminative models
try to model the conditional probability P(Y|X).
the discriminative models as the models that try to predict the changes in the output
based on only changes in the input.
The generative models are the models that try to model the changes in the output
based on changes in input as well as the changes in the state.
The probabilistic approaches (discriminative as well as generative) are also sliced based on
two universities of thought groups:
1. Maximum likelihood estimation
2. Bayesian approach
The maximum likelihood estimation or MLE approach deals with the problems at the
face value and parameterizes the information into variables. The values of the variables
that maximize the probability of the observed variables lead to the solution of the
problem.
Let us define the problem using formal notations. Let there be a function f(x; θ) that
produces the observed output y. x ∈
θ ∈ Θ represent a parameter vector that can be single or multidimensional.
The MLE method defines a likelihood function denoted as L(y|θ). Typically the likelihood
function is the joint probability of the parameters and observed variables as L(y|θ) = P(y; θ).
The objective is to find the optimal values for θ that maximizes the likelihood function as
given by
or
Bayesian Approach
All the unknowns are modelled as random variables with known prior probability
distributions. Let us denote the conditional prior probability of observing the output y for
parameter vector θ as P(y|θ). The marginal probabilities of these variables are denoted
as P(y) and P(θ). The joint probability of the variables can be written in terms of
conditional and marginal probabilities as
P(y; θ) = P(y|θ) · P(θ) 1
The same joint probability can also be given as
P(y; θ) = P(θ|y) · P(y) 2
Here the probability P(θ/y) is called as posterior probability, Combining equ. 1 and 2
P(θ|y) · P(y) = P(y|θ) · P(θ)
rearranging the terms we get
Equation 3 is called as Bayes’ theorem. This theorem gives relationship between the
posteriory probability and priori probability in a simple and elegant manner. This equation
is the foundation of the entire bayesian framework.
Each term in the above equation is given a name,
P(θ) is called as prior,
P(y|θ) is called as likelihood,
P(y) is called as evidence, and P(θ|y) is called as posterior.
The Bayes’ estimate is based on maximizing the posterior. Hence, the optimization problem
based on Bayes’ theorem can now be stated as
The likelihood function is defined as, L(y|θ) = P(y; θ), where y denotes the outcome of the
trial and θ denotes the property of the coin in the form of probability of getting given
outcome. Let probability of getting a Head be h and probability of getting a Tai l will be 1-h.
Now, outcome of each toss is independent of the outcome of the other tosses. Hence the
total likelihood of the experiment can be given as
So we can now proceed with the optimization problem as before. In order to maximize the
posterior, let’s differentiate it with respect to h as before,
of applications including error analysis. Another reason the normal distribution is popular is
Bernoulli Distribution
Binomial Distribution
Binomial distribution generalizes Bernoulli distribution for multiple trials.
Binomial distribution has two parameters n and p. n is number of trials of the
experiment,where probability of success is p. The probability of failure is q = 1 - p just like
Bernoulli distribution, but it is not considered as a separate third parameter. The pdf for
binomial distribution is given as
where
Gamma distribution is also one of the very highly studied distribution in theory of
statistics. It forms a basic distribution for other commonly used distributions like chi-
squared distribution, exponential distribution etc., which are special cases of gamma
distribution. It is defined in terms of two parameters: a and ß. The pdf of gamma
distribution is given as
The cdf of gamma function cannot be stated easily as a single valued function,but rather is
given as sum of an infinite series as
where the single parameter λ is the average number of events in the interval. The cdf
of Poisson distribution is given as
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Below are some main reasons which describe the importance of Unsupervised Learning:
•Unsupervised learning is helpful for finding useful insights from the data.
•Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
•Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
•In real-world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning
Diiferent aspects of unsupervised learning:
1. Clustering
2. Component Analysis
3. Self Organizing Maps (SOM)
4. Autoencoding neural networks
Clustering
• Clustering is essentially aggregating the samples in the form of groups. The criteria used
for deciding the membership to a group is determined by using some form of metric or
distance.
• Simple method of clustering is K-means clustering. The variable K denotes number of
clusters. The method expects the user to determine the value of K before starting to
apply the algorithm.
• k-Means Clustering
The k-means clustering algorithm can be summarized as follows:
1. Start with a default value of k, which is the number of clusters to find in the given data.
2. Randomly initialize the k cluster centers as k samples in training data, such that there are
no duplicates.
3. Assign each of the training samples to one of the k cluster centers based on a chosen
distance metric.
4. Once the classes are created, update the centers of each class as mean of all the
samples in that class.
5. Repeat steps 2–4 until there is no change in the cluster centers.
The distance metric used in the algorithm is typically the Euclidean distance.
In most practical situations, where the clusters are not well separated, or the
number of naturally occurring clusters is different than the initialized value of k, the
algorithm may not converge.
The cluster centers can keep oscillating between two different values in
subsequent iterations or they can just keep shifting from one set of clusters to
another. In such cases multiple optimizations of the algorithms are proposed
Optimizations for k-Means Clustering
1. Change the stopping criterion from absolute “no change” to cluster centers to allow for a
small change in the clusters.
2. Restrict the number of iterations to a maximum number of iterations.
3. Find the number of samples in each cluster and if the number of samples is less than a
certain threshold, delete that cluster and repeat the process.
4. Find the intra-cluster distance versus inter-cluster distance, and if two clusters are too close
to each other relative to other clusters, merge them and repeat the process.
5. If some clusters are getting too big, apply a threshold of maximum number of samples in a
cluster and split the cluster into two or more clusters and repeat the process.
Component Analysis
Another important aspect of unsupervised machine learning is dimensionality reduction.
Component analysis methods are quite effective in this regard. It is similar to principle
component Analysis.
Independent Component Analysis (ICA)
ICA takes a very different and more probabilistic approach towards finding the core
dimensions in the data by making the assumption that given data is generated as a result
of combination of a finite set of independent components. These independent components
are not directly observable and hence sometimes referred to as latent components.
Mathematically, the ICA can be defined for given data as data (xi),i = 1, 2,...,n
Self organizing maps, also called as self organizing feature maps present a neural network
based unsupervised learning system.
SOMs define a different type of cost function that is based on similarity in the
neighborhood. The idea here is to maintain the topological distribution of the data while
expressing it in smaller dimensions efficiently.
In order to illustrate the functioning of SOM, it will be useful to take an actual example.
Fig.6.2 shows data distributed in 3dimensions.
This is a synthetically generated data with ideal distribution to illustrate the concept.
Figure 6.2 showing original 3-dimensional distribution of the data and its 2-dimensional
representation generated by SOM.
• The data is essentially a 2-dimensional plane folded into 3-dimensions.SOM unfolds the
plane back into 2-dimensions as shown in the bottom figure. Withthis unfolding, the
topological behavior of the original distribution is still preserved.All the samples that
are neighbors in original distribution are still neighbors. Also,the relative distances of
different points from one another are also preserved in that order.
Autoencoding neural networks or just autoencoders are a type of neural networks that
work without any labels and belong to the class of unsupervised learning.
Below figure shows architecture of autoencoding neural network. There is input layer
matching the dimensionality of the input and hidden layer with reduced dimensionality
followed by an output layer with the same dimensionality as input.
The target here is to regenerate the input at the output stage.
The network is trained to regenerate the input at the output layer. Thus the labels are
essentially same as input.
Consider UCI: Adult Salary Predictor, UCI Machine Learning Repositoty is one of the well-
known resources for finding sample problems in machine learning. It hosts multiple data
sets targeting variety of different problems. The data presented here contains information
collected by census about the salaries of people from different work-classes, age groups,
locations, etc. The objective is to predict the salary, specifically predict a binary classification
between salary greater than $50K/year and less than $50K/year.
73
Part A – Q & A
Unit - II
PART -A
7
5
PART -A
13. What are the advantages and disadvantages of decision trees? CO2, K1
They are very fast and efficient compared to KNN and other
classification algorithms. Easy to understand, interpret, visualize. The
data type of decision tree can handle any type of data whether it is
numerical or categorical, or boolean. Normalization is not required in
the Decision Tree
7
6
PART -A
S.N Question and Answer CO,K
o
14 What is SVM example? CO2, K1
Support Vector Machine (SVM) is a supervised machine learning
algorithm capable of performing classification, regression and
even outlier detection. The linear SVM classifier works by drawing a
straight line between two classes
7
8
Part B – Questions
S.No Question and Answer CO,K
PART -B
1. Briefly explain about Regression and its types with suitable examples. CO2,K3
5. Explain in detail about support vector machines with suitable examples. CO2,K3
8
0
Supportive online
Certification courses
(NPTEL, Swayam,
Coursera, Udemy, etc.,)
SUPPORTIVE ONLINE COURSES
Course
S No Course title Link
provider
Introduction to Machine
1 Learnng https://onlinecourses.nptel.ac
NPTEL .in/noc22_cs29
http://surl.li/crwpi
2 Coursera Building Machine
Learning Models
https://www.coursera.org/lear
Coursera Predictive Analytics n/population-health-
3
predictive-analytics
82
Real time Applications in
day to day life and to
Industry
REAL TIME APPLICATIONS IN DAY TO DAY LIFE
AND TO INDUSTRY
84
Content Beyond Syllabus
Contents beyond the Syllabus
Apriori Algorithm
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent
itemsets in a dataset for boolean association rule. Name of the algorithm is Apriori because it
uses prior knowledge of frequent itemset properties. We apply an iterative approach or level-
wise search where k-frequent itemsets are used to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent itemsets, an important
property is used called Apriori property which helps by reducing the search space.
AprioriProperty –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori
algorithm is its anti-monotonicity of support measure. Apriori assumes that all subsets of a
frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Before we start understanding the algorithm, go through some definitions which are
explained in my previous post.
Consider the following dataset and we will find frequent itemsets and generate association
rules for them.
86
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then remove
those items). This gives us itemset L1.
Step-2: K=2
•Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common.
•Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
•Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L2.
Step-3:
•Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
•Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3,
I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
•find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here min_support=2
if support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L3
Step-4:
•Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2 elements (items)
should match.
•Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3
is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in
C4
•We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of each
rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread
also bought butter
Assessment Schedule
(Proposed Date & Actual
Date)
Assessment Schedule
90
Prescribed Text Books &
Reference
Prescribed Text Books & Reference Books
TEXT BOOKS
1. Ameet V Joshi, Machine Learning and Artificial Intelligence, Springer Publications, 2020
2. John D. Kelleher, Brain Mac Namee, Aoife D’ Arcy, Fundamentals of Machine learning for
Predictive Data Analytics, Algorithms, Worked Examples and case studies, MIT press,2015
REFERENCES
1. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer Publications,
2011
2. Stuart Jonathan Russell, Peter Norvig, John Canny, Artificial Intelligence: A Modern
Approach, Prentice Hall, 2020
3. Machine Learning Dummies, John Paul Muller, Luca Massaron, Wiley Publications, 2021
92
Mini Project Suggestions
Mini Project Suggestion
a) An emergency room in a hospital measures 17 variables like blood pressure, age, etc. of
newly admitted patients. A decision has to be made whether to put the patient in an
ICU. Due to the high cost of ICU, only patients who may survive a month or more are
given higher priority. Such patients are labeled as “low-risk patients” and others are
labeled “high-risk patients”. The problem is to device a rule to classify a patient as a
“low-risk patient” or a “high-risk patient”.
b) A credit card company receives hundreds of thousands of applications for new cards.
The applications contain information regarding several attributes like annual salary, age,
etc. The problem is to devise a rule to classify the applicants to those who are credit-
worthy, who are not credit-worthy or to those who require further analysis.
c) Astronomers have been cataloguing distant objects in the sky using digital images
created using special devices. The objects are to be labeled as star, galaxy, nebula, etc.
The data is highly noisy and are very faint. The problem is to device a rule using which
a distant object can be correctly labeled.
94
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
relianceon the contentsof this informationis strictly prohibited.
95