ML Assignment3 Solution
ML Assignment3 Solution
Kumbalgodu, Bangalore-74
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & MACHINE LEARNING
Assignment 2
Sl.No Questions CO
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the
linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between
the variables. Consider the below image:
o In the above image, we have taken a dataset which is arranged non-linearly. So if we try to
cover it with a linear model, then we can clearly see that it hardly covers any data point.
On the other hand, a curve is suitable to cover most of the data points, which is of the
Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.
Equation of the Polynomial Regression Model:
Simple Linear Regression equation: y = b0+b1x .........(a)
Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)
2 3 n
Polynomial Regression equation: y= b0+b1x + b2x + b3x +....+ bnx ..........(c)
3 Explain logistic regression with an example. 3
o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The below
image is showing the logistic function:
Logistic regression is used to solve classification problems, and the most common use case
is binary logistic regression, where the outcome is binary (yes or no). In the real world, you can
see logistic regression applied across multiple areas and fields.
• In health care, logistic regression can be used to predict if a tumor is likely to be benign or
malignant.
• In the financial industry, logistic regression can be used to predict if a transaction is
fraudulent or not.
• In marketing, logistic regression can be used to predict if a targeted audience will respond
or not.
The three types of logistic regression
1. Binary logistic regression - When we have two possible outcomes, like our original
example of whether a person is likely to be infected with COVID-19 or not.
2. Multinomial logistic regression - When we have multiple outcomes, say if we build out
our original example to predict whether someone may have the flu, an allergy, a cold, or
COVID-19.
3. Ordinal logistic regression - When the outcome is ordered, like if we build out our
original example to also help determine the severity of a COVID-19 infection, sorting it
into mild, moderate, and severe cases.
Instead of climbing up a hill, think of gradient descent as hiking down to the bottom of a valley.
This is a better analogy because it is a minimization algorithm that minimizes a given function.
The equation below describes what the gradient descent algorithm does: b is the next position of
our climber, while a represents his current position. The minus sign refers to the minimization part
of the gradient descent algorithm. The gamma in the middle is a waiting factor and the gradient
term ( ∆f(a) ) is simply the direction of the steepest descent
So this formula basically tells us the next position we need to go, which is the direction of the
steepest descent. Let’s look at another example to really drive the concept home.
5 Explain Support Vector Machine with an example. 3
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
6 Write a note on regularized linear models. 5
In machine learning, we often face the problem when our model behaves well on training data but
behaves very poorly on test data. This happens when the model closely follows the training data
i.e overfits the data.
Regularization is a technique to reduce overfitting. The term regularization means the act of
bringing to uniformity.
Complex models can detect a subtle pattern in the data, but if the data is noisy(contains irrelevant
information) or the dataset is too small, the model will end up detecting the pattern in the noise
itself. When we use this model to predict our results, the result will not be accurate and the error
will be more than the expected error.
In linear regression, the final output is the weighted sum of the feature variables which is
represented by the below equation.
y = w1x1+w2x2+w3x3+…+wn xn+w₀
Here weights w1, w2, …, wn represent the importance of the features(x1, x2,..xn). A feature will
be of high importance if it has a large weight associated with it.
The error in linear regression will be the mean squared error which is given below:
For a linear model, regularization is achieved by constraining the weights of the model. To
constrain the weights first we need to understand how these weights are calculated. Weights are
calculated as per the cost function, for linear regression cost function is mean squared error.
Weights are tweaked each time and MSE is calculated and the set that has minimum MSE will be
considered as the final output.
To regularize the model, the regularization term will be added to the cost function.
Regularized Cost Function = MSE+ Regularization term
Here we will see three different regularization term to constrain the weights of the model, thus
three different regularized linear regression algorithms:
1. Ridge Regression
2. Lasso Regression
3. Elastic Net
7 Explain Bayes theorem with a example. 5
The Bayes theorem is a mathematical formula for calculating conditional probability in probability
and statistics. In other words, it's used to figure out how likely an event is based on its proximity to
another. Bayes law or Bayes rule are other names for the theorem.
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two
random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference is an application of Bayes' theorem, which is fundamental to Bayesian statistics.
It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).
The formula for the Bayes theorem can be written in a variety of ways. The following is the most
common version:
P(A ∣ B) = P(B ∣ A)P(A) / P(B)
P(A ∣ B) is the conditional probability of event A occurring, given that B is true.
P(B ∣ A) is the conditional probability of event B occurring, given that A is true.
P(A) and P(B) are the probabilities of A and B occurring independently of one another.
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A). This
is very useful in cases where we have a good probability of these three terms and want to
determine the fourth one. Suppose we want to perceive the effect of some unknown cause, and
want to compute that cause, then the Bayes' rule becomes:
Example-1: what is the probability that a patient has diseases meningitis with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80% of
the time. He is also aware of some more facts, which are given as follows:
o The Known probability that a patient has meningitis disease is 1/30,000.
o The Known probability that a patient has a stiff neck is 2%.
Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.
8 Write a note on Bayes Optimal Classifier 5
Bayes Optimal Classifier is a probabilistic model that finds the most probable prediction using the
training data and space of hypotheses to make a prediction for a new data instance.
The Bayes Optimal Classifier is a probabilistic model that makes the most probable prediction for
a new example.
It is described using the Bayes Theorem that provides a principled way for calculating a
conditional probability. It is also closely related to the Maximum a Posteriori: a probabilistic
framework referred to as MAP that finds the most probable hypothesis for a training dataset.
In practice, the Bayes Optimal Classifier is computationally expensive, if not intractable to
calculate, and instead, simplifications such as the Gibbs algorithm and Naive Bayes can be used to
approximate the outcome.
Let there be 5 hypotheses h1 through h5.
Thus, the Bayes optimal procedure recommends the robot turn left.
9 Elaborate on Naïve Bayes Classifier . 5
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability
of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
10 Write a note on Bayesian Belief Network 5
Bayesian belief network is key computer technology for dealing with probabilistic events and to
solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists
of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
“A task of acquiring potential hypothesis (solution) that best fits the given training examples.”
Concept learning is the task of inferring a Boolean-valued function from a set of training
examples. The purpose of inferring this function is to use it as a general rule for classifying unseen
data.
Concept learning is based on a type of learning called inductive learning. In inductive learning, the
learner learns by example. In other words, the learner discovers the rules of a particular concept by
learning the examples of that concept. For example, if a student teaches himself algebra, the more
he practices different types of examples and solutions, the more he will understand the general
rules of algebra. The idea is the same for concept learning: a machine is taught the different
examples of a concept, and by learning these examples, it will “discover” the general rule(s) that
apply to that concept.
Concept learning thus involves learning a function (which is a rule) from a set of training
examples.
concept learning aims to find a function or rule that truly represents the particular concept being
learned. The function must be a true representation of the concept so that it can be able to make
accurate classifications of unseen data. By “true representation”, it means that the function must be
able to approximate the true value of a target concept. The target concept refers to what we’re
trying to classify. A Boolean-valued function, denoted c(x), can take on two or more possible
categories. The aim is generally to determine the category of the target concept that a certain
object belongs to.
According to the Inductive Learning Hypothesis, if a function can approximate the target concept
well enough over training examples, then it will be able to approximate the target concept well for
unseen examples.
For example, suppose an algebra learner has gained an understanding of the general rules of
algebra based on the examples they’ve practiced. In that case, they’ll be able to apply those rules
to solve any new problems that they encounter. Similarly, in concept learning, an inferred function
will be able to approximate and classify new data based on how well it has learned in the past.
Concept learning works in two ways. It works by:
Inferring a function from a set of training examples.
Searching to find the function that best fits the training examples.