0% found this document useful (0 votes)

18 views185 pages

ML Module 2

The document discusses linear regression models including simple linear regression with one variable, multiple linear regression with multiple variables, and the use of gradient descent to estimate the coefficients in linear regression models by minimizing a cost function. It also covers topics like classification methods including logistic regression, linear discriminant analysis, quadratic discriminant analysis, and K-nearest neighbors classification.

Uploaded by

Roudra Chakraborty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views185 pages

ML Module 2

Uploaded by

Roudra Chakraborty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 185

ML- Module-2

Dr. Debanjali Bhattacharya

Syllabus and Reference

MODULE 2 - Linear Regression: Simple Linear Regression, Multiple

Linear Regression. Classification: Logistic Regression, Linear
Discriminant Analysis (LDA): Bayes Theorem for classification,
Quadratic Discriminant Analysis (QDA), KNN Method, Comparison
of classification methods.

Reference:
• James G., Witten D., Hastie T., Tibshirani R. “An Introduction to
Statistical Learning with Applications in R”, Springer Texts in
Statistics.
• Machine learning by Andrew Ng.
Linear Regression
Linear Regression

• Uses Least-square approach to fit the model on

dataset

• Linear regression with one variable

–> Univariate linear regression

• Linear regression with multiple variable

–> Multiple linear regression/ multivariate linear
regression
Linear Regression

• Simple linear regression : A very

straightforward approach for
predicting a quantitative response Y
on the basis of a single predictor
variable X.

• It assumes that there is approximately

a linear relationship between X and
Y.
Linear Regression

Mathematically, we can write this linear relationship as

• β0 and β1 are two unknown constants that represent the

intercept and slope terms in the linear model.

• Together, β0 and β1 are intercept and slope- known as the model

coefficients or parameters.
Linear Regression

• Once we have used our coefficient parameter training data to

produce estimates and for the model coefficients, we can predict
future output by computing

Where, indicates prediction of Y on the basis of X = x,

• hat symbol ( ˆ) is used to denote the estimated value for an
unknown parameter or coefficient, or to denote the predicted
value of the response.
Simulated data set. Left: The red line represents the true relationship, f(X) = 3X+2,
which is known as the population regression line. The blue line is the least squares
line; it is the least squares estimate for f(X) based on the observed data, shown in
black.
Right: The population regression line is again shown in red, and the least squares line
in dark blue. In light blue, ten least squares lines are shown, each computed on the
basis of a separate random set of observations. Each least squares line is different, but
on average, the least squares lines are quite close to the population regression line.
Simulated data set. Left: The red line represents the true relationship, f(X) = 3X+2, which is
known as the population regression line. The blue line is the least squares line; it is the least
squares estimate for f(X) based on the observed data, shown in black.
Right: The population regression line is again shown in red, and the least squares line in dark
blue. In light blue, ten least squares lines are shown, each computed on the basis of a separate
random set of observations. Each least squares line is different, but on average, the least
squares lines are quite close to the population regression line.

Goal: To find the straight line that best fit the data
Goal: To find the straight line that best fit the data

 Regression line is the line that “best fit” the data.

 Best fit – is defined from method of least-squares.

Linear Regression
• How to estimate the coefficient?

X
Linear regression with one variable
Linear regression with one variable

Regression: Predict the real valued output

Linear regression with one variable

(x, y) – a single training example

- i’th training example

Linear regression with one variable

X Hypothesis Estimate values of y

Linear regression with one variable
How to represent ‘h’ ?

𝒉𝜽 ( 𝒙 )=𝜽 𝟎+ 𝜽 𝟏 𝒙

(y)

(x)
Linear regression with one variable
Linear regression with one variable

h(x) = 0.5*x + 1
h(x) = 0.5*x + 0
h(x) = 0*x + 1.5
Linear regression with one variable

𝑚
1
𝐽 ( 𝜃 0 , 𝜃1 ) = ∑ (h𝜃 ( 𝑥 ) − ¿ 𝑦 ) ¿
(𝑖) (𝑖) 2
2𝑚 𝑖=1
𝒉𝜽 ( 𝒙)

Cost Function

Cost Function = squared error function

Linear regression with one variable

𝒉𝜽 ( 𝒙)
Linear regression with one variable:
Cost function

𝜃 1=1

𝑚
1 𝐽 ( 1 ) =0
𝜃 0 , 𝜃1 ) = ∑
2 𝑚 𝑖=1
(𝑖) (𝑖) 2
(h𝜃 ( 𝑥 ) − ¿ 𝑦 ) =0 ¿
𝐽 ( 0.5 ) =?
Linear regression with one variable:
Cost function

𝜃 1= 0
𝜃 1=1

.58 𝐽 ( 0.5 )= 0.58

𝐽 ( 0 ) =?
Linear regression with one variable:
Cost function

𝜃 1= 0.5

𝜃 1= 0
𝜃 1=1

𝐽 ( 0.5 )= 2.3
Linear regression with one variable:
Cost function

𝜃 1=1

Plot for different values of θ )

Linear regression with one variable:
Cost function

𝜽𝟏
𝜃0 =50 ; 𝜃 1=0.06 • When we have single parameter
, the cost function is a bowl
shaped.

• How it will look like if we have

multiple parameters , ?
Linear regression with one variable:
Cost function

Contour Plot

J
Linear regression with one variable:
Cost function

h(x)

-0.15

800

h(x) = -0.15*x + 800 𝜃0 =800 ; 𝜃1=− 0.15

Linear regression with one variable:
Cost function

h(x)

h(x) = 0*x + 360 𝜃 0 =360 ; 𝜃 1=0

Linear regression with one variable:
Cost function

h(x)
Linear regression with one variable:
Cost function

h(x)
Cost function:
Gradient descent algorithm

}
Start with some value of
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…

Local Minima
Start with some value of
Start with some value of
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…

Local Minima
Converges to local minima…
Derivative term

α is the learning rate

𝑱 ( 𝜽𝟏 )

𝑱 ( 𝜽𝟏 )
• Gradient descent can converge to a local minimum, even with the
learning rate α fixed

• As we approach a local minimum, gradient descent will

automatically take smaller steps. So, no need to decrease α over
time.
Gradient descent algorithm for Linear
regression (with one variable)
Gradient descent algorithm for Linear
regression (with one variable)
Gradient descent algorithm for Linear
regression (with one variable)

𝜹
𝑱 (𝜽 𝟎 , 𝜽𝟏 )
𝜹 𝜽𝟎

𝜹
𝑱 ( 𝜽 𝟎 , 𝜽𝟏 )
𝜹 𝜽𝟏
Update simultaneously
Local Minima
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Multivariate Linear
Regression
 Linear regression with multiple variable
(multiple features): Multiple linear
regression
Linear Regression with multiple variables
Linear Regression with multiple variables

Y
X1 X2 X3 X4
# m = training
example

𝟏 𝟓𝟑𝟒
𝑿 ( 𝟑) =⌈ 𝟑 ⌉∈𝑹
n=4 𝟐
𝟑𝟎

=?
Linear Regression with multiple variables

+ ++…+

Example:
+ 3* + 0.01* + … + 2*
X θ

+ +…+
=
𝑇
𝜃 =[ 𝜃 1 𝜃 2 . . . 𝜃 𝑛 ]
Gradient descent for multiple variables
𝒙 𝟎 =𝟏

𝑱 (𝜽)

𝑱 (𝜽)
Gradient descent for multiple variables
𝜹
𝑱 (𝜽
𝜹𝜽

𝜹
𝑱 ( 𝜽)
𝜹 𝜽𝟎
𝑱 ( 𝜽 ) 𝒔𝒉𝒐𝒖𝒍𝒅 𝒅𝒆𝒄𝒓𝒆𝒔𝒆 𝒂𝒇𝒕𝒆𝒓 𝒆𝒗𝒆𝒓𝒚 𝒊𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏
θ
θ

Declare convergence if J(θ) decreases by

less than
𝑱 (𝜽)

𝑱 (𝜽)
• For sufficiently small α, J(θ) should decrease on every iteration.

• But if α is too small, gradient descent can be slow to converge.

• If α is too large J(θ) may not decrease on each iteration; may not converge.

• Choose the value of α carefully (choose α = 0.001, 0.003, 0.01, 0.03, 0.3, 0.5)
Non-linear regression
(Polynomial regression)
House Price Prediction

+ +

𝑿𝟏 𝑿𝟐

Include
=

h 𝜃 ( 𝑋 ) = 𝜃 0 + 𝜃 1 𝑋 1 + 𝜃 2 𝑋 2+ 𝜃 3 𝑋 3
Polynomial features:
Regularization
Bias & Variance
Logistic regression
Logistic Regression
• Logistic regression: A method that is used to predict qualitative
responses (known as classification).

• Predicting a qualitative response for an observation can be

referred to as classifying that observation, since it involves
assigning the observation to a particular category, or class.

• Often the methods used for classification, first predict the

probability of each of the categories of a qualitative variable,
belonging to a particular class, as the basis for making the
classification. In this sense they also behave like regression
methods.
• Some widely used techniques to predict
qualitative or categorical response is

 Logistic Regression
 Linear discriminant analysis (LDA)
 K-nearest neighbor (KNN)
Logistic Regression
Logistic Regression
Classification
boundary
2 examples of malignant tumor are misclassified
Classification
boundary
2 examples of malignant tumor are misclassified
Classification
boundary

Applying linear regression to

the classification problem is
often not a great idea!!
Using linear regression for classification

• To avoid this problem, we must model using a function that

gives outputs between 0 and 1 for all values of X.
Logistic regression model

Want

1
h 𝜃 ( 𝑋 )= −𝜃
𝑇
𝑋
1 +𝑒
𝑇
𝜃 𝑋
𝑒
h𝜃 ( 𝑥) = 𝜃
𝑇
𝑋
1+ 𝑒
Logistic regression model
Logistic regression: Decision boundary
Logistic regression: Decision boundary
Logistic regression: Decision boundary

Y=1

Y=0
Logistic regression: Decision boundary

Non-linear decision boundary

Logistic regression: Cost function

Nonlinear function
𝑇

𝑒 𝑋
𝜃
h𝜃 ( 𝑥) = 𝜃 𝑋
𝑇

1+ 𝑒
Logistic regression: Cost function
𝑇
𝜃 𝑋
𝑒
h𝜃 ( 𝑥) = 𝑇

1+ 𝑒 𝜃 𝑋

h𝜃( 𝑥) 𝜃
𝑇
𝑋
=𝑒
1 − h𝜃 ( 𝑥 )

The left-hand side is called the log-odds or logit. We see that the
logistic regression model has a logit that is linear in X.
Logistic regression: Cost function
Linear Regression:

Logistic Regression:

WE WANT

Square cost function with non-linearity

does not give guarantee to converge to
the local minimum
Logistic regression: Cost function
In LR, we define a different cost function which is convex so that
we can apply optimization algorithm like gradient descent to find
global minima.

Cost
function
Logistic regression: Cost function

In LR, we define a different cost function which is convex so that

we can apply optimization algorithm like gradient descent.

Cost
function
Logistic regression: Cost function

In LR, we define a different cost function which is convex so that

we can apply optimization algorithm like gradient descent.

Putting together
Logistic regression: Cost function

Get θ
Logistic regression: Cost function
Logistic regression:
Multi-class classification (one-Vs.-all)

y=1 y=2 y=3 y=4

y=1 y=2 y=3

Logistic regression:
Multi-class classification (one-Vs.-all)
Logistic regression:
Multi-class classification (one-Vs.-all)
Linear Discriminant Analysis
(LDA)
Linear Discriminant Analysis (LDA)

• Logistic regression involves directly modeling the probability

using the logistic sigmoid function, for the case of two
response classes.

• In statistics, we model the conditional distribution of the

response Y , given the predictor(s) X.

• We now consider an alternative and less direct approach to

estimating these probabilities.
Linear Discriminant Analysis (LDA)
• Alternative approach:
1. Model the distribution of the predictors X separately
in each of the response classes (i.e. given Y ).
2. Then use Bayes’ theorem to flip these around into
estimates for

• When these distributions are assumed to be normal, it turns

out that the model is very similar in form to logistic
regression.

• Why do we need another method, when we have logistic

regression?
Why do we need another method, when we have logistic regression?

There are several reasons:

1. When the classes are well-separated, the parameter estimates for the
logistic regression model are surprisingly unstable. LR has the peculiar
behavior that if a feature separates the classes perfectly, the
coefficients go to infinity. LR works better when the classes are not
well separable. LDA does not suffer from this problem.

2. If n is small (small dataset) and the distribution of the predictors X is

approximately normal in each of the classes, the linear discriminant
model is again more stable than the logistic regression model.

3. LDA is popular when we have more than two response classes.

Linear Discriminant Analysis (LDA)
Bayes’ Classifier:
• It is seen that the test error rate given in classification settings is
minimized, on average, by a very simple classifier that assigns each
observation to the most likely class, given its predictor values.

• We can simply assign a test observation with predictor vector to the

class K for which = is largest.

• This is also called conditional probability: it is the probability

conditional that Y = k, given the observed predictor vector .This very
simple classifier is called the Bayes classifier.
Linear Discriminant Analysis (LDA)
Bayes’ Theorem:
• Using Bayes theorem the conditional probability can be
written as
(i) (Proof…)

where, is the number of classes for K = 1, 2, …,n

And (ii) (Proof…)

Therefore, 𝑃 ( 𝐶 𝐾 ) 𝑃 ( 𝑋 ∨𝐶 𝐾 )
P ( 𝐶 𝐾| 𝑋 ) = 𝑛

∑ 𝑃 ( 𝑋 |𝐶 𝐾 ) 𝑃 ( 𝐶 𝐾 )
𝑘=1
Linear Discriminant Analysis (LDA)
Bayes’ Classifier:
• The Bayes theorem states that

𝑃 ( 𝐶 𝐾 ) 𝑃 ( 𝑋 ∨𝐶 𝐾 )
P ( 𝐶 𝐾| 𝑋 ) = 𝑛

∑ 𝑃 ( 𝑋 |𝐶 𝐾 ) 𝑃 ( 𝐶 𝐾 )
𝑘=1

Here, denotes the prior probability that an observation belongs

to the k’th class.

• In a two-class problem where there are only two possible response

values, say class-1 or class-2, the Bayes classifier corresponds to
predicting class-1 if , and class-2 otherwise.
Linear Discriminant Analysis (LDA)
Bayes’ Classifier:
• Using Bayes theorem the conditional probability can be written as
Linear Discriminant Analysis (LDA)
Naïve Bayes’ Classifier:
• In ML Naïve Bayes classifier are a family of simple probabilistic
classifier which is based on Bayes theorem with strong (naïve)
independence assumptions between features, i.e. each
feature is conditionally independent of every other feature
given the category or class C.

• This means

Expression of
probability density
function (PDF) of
Naïve Bayes classifier
Linear Discriminant Analysis (LDA)
Naïve Bayes’ Classifier:

Expression of
probability density
function (PDF) of
Naïve Bayes classifier

The joint distribution of the sample is equal to product of conditional

probabilities if are independent to each other
Linear Discriminant Analysis (LDA)
Using Bayes’ Theorem for Classification

• We know that the Bayes classifier, which classifies an observation

to the class for which = is largest.

• Therefore, if we can find a way to estimate , then we can develop a

classifier that approximates the Bayes classifier.

• Such an approach is LDA.

• LDA (Gaussian Bayes classifier) is a method in statistics to find the

linear combination of features that separates two or more classes
of object.
LDA for p = 1

• Assume that p = 1, that is, we have only one predictor.

• We would like to obtain an estimate for that we can plug into

Bayes theorem in order to estimate

• We will then classify an observation to the class for which is

greatest. In order to estimate , we will first make some
assumptions about its form.
LDA for p = 1

We assume that is normal or Gaussian. In the one dimensional setting, the

normal PDF takes the form

( )
2
1 𝑥 −𝜇 𝐾
−
1 2 𝜎𝐾
𝑃 ( 𝑥𝑖|𝐶 𝐾 )= 𝑒
𝜎𝐾 √ 2 𝜋

where and are the mean and variance parameters for the class.

For now, let us further assume that : that is, there is a shared variance
term across all K classes, which for simplicity we can denote by .
LDA for p = 1
( )
2
1 𝑥 −𝜇 𝐾
−
1 2 𝜎𝐾
𝑃 ( 𝑥𝑖|𝐶 𝐾 )=
Substituting the equation 𝑒
𝜎𝐾 √ 2 𝜋
in the equation of Bayes’ theorem (stated earlier) we get,

( )
2
1 𝑥− 𝜇𝐾
−
1 2 𝜎𝐾
𝑃 ( 𝐶𝐾 ) 𝑒
𝜎 𝐾 √2 𝜋
P ( 𝐶 𝐾| 𝑋 ) =
( )
2
𝑛 1 𝑥 − 𝜇𝐾
−
1
∑ 𝑃 (𝐶 𝐾 ) 𝜎 √ 2 𝜋 𝑒
2 𝜎𝐾

𝑖 =1 𝐾

Here, denotes the prior probability that an observation belongs to

the k’th class. denotes the posterior probability
LDA for p = 1
( )
2
1 𝑥 − 𝜇𝐾
−
1 2 𝜎𝐾
𝑃 ( 𝐶𝐾 ) 𝑒
𝜎𝐾 √2 𝜋
P ( 𝐶 𝐾| 𝑋 ) =
( )
2
1 𝑥 − 𝜇𝐾
𝑛 −
1
∑ 𝑃 (𝐶 𝐾 )
𝜎 𝐾 √2 𝜋
𝑒
2 𝜎𝐾

𝑖 =1

• The Bayes classifier involves assigning an observation X = x to the class

for which (obtained from above eqn.) is largest.

• Taking the log of the above eqn. and rearranging the terms, it can be
shown that this is equivalent to assigning the observation to the class
for which is largest
LDA for p = 1
( )
2
1 𝑥 − 𝜇𝐾
−
1 2 𝜎𝐾
𝑃 ( 𝐶𝐾 ) 𝑒
𝜎𝐾 √2 𝜋
P ( 𝐶 𝐾| 𝑋 ) =
( )
2
1 𝑥 − 𝜇𝐾
𝑛 −
1
∑ 𝑃 (𝐶 𝐾 )
𝜎𝐾 √2 𝜋
𝑒
2 𝜎𝐾

𝑖 =1

• The Bayes classifier involves assigning an observation X = x to the class

for which (obtained from above eqn.) is largest.

• Taking the log of the above eqn. and rearranging the terms, it can be
shown that this is equivalent to assigning the observation to the class
for which is largest

• Generally for 2-classes, we look into the log ratio of their posterior
probability:
LDA for p = 1

• For instance, if K = 2 and the prior probability that an observation

belongs to a specific class is , then it can be shown that the Bayes
classifier assigns an observation to class 1 if
, and to class 2 otherwise.

(Prove it !...)
LDA for p = 1

• For instance, if K = 2 and the prior probability that an observation

belongs to a specific class is , then it can be shown that the Bayes
classifier assigns an observation to class 1 if
, and to class 2 otherwise.

• In this case, the Bayes decision boundary corresponds to the point

where
LDA for p = 1
• The two normal density functions that are displayed, represent two distinct
classes.

• The mean and variance parameters for the two density functions are as
follows:

• The two densities overlap, and so given that X = x, there is some uncertainty
about the class to which the observation belongs.
If we assume that an observation is equally likely to come from either
class—that is, then from eqn. , we see that the Bayes classifier assigns the
observation to class 1 if x < 0 and class 2 otherwise.
LDA for p = 1
• In practice, even if we are quite certain of our assumption that X is drawn
from a Gaussian distribution within each class, we still have to estimate the
parameters .

• The LDA approximates the Bayes classifier by using the following estimates:

Where, n is the total number of training

observations, and
is the number of training
observations in the k’th class.

𝑛𝐾
𝑃 (𝐶 𝐾 )=
𝑛
• Sometimes we have knowledge of the class membership probabilities , which
can be used directly. In the absence of any additional information, LDA
estimates using the proportion of the training observations that belong to the
k’th class.
LDA for p = 1
𝑛𝐾
𝑃 (𝐶 𝐾 )=
𝑛

The LDA classifier plugs all these estimates given into

and assigns an observation X = x to the class for which
is maximum
LDA for p = 1

^ ^ 𝟐
𝝁 𝝁𝑲
+ 𝐥 𝒏 ⁡[ 𝑷 ( 𝑪 𝑲 ) ]= ^
𝑲
− 𝜹𝑲 ( 𝒙 )
^
𝝈 𝟐 ^
𝟐𝝈 𝟐

• The word ‘linear’ in the classifier’s name stems from the fact
that the discriminant functions in the above are linear functions
of x.

• Summary (p = 1):
The LDA classifier results from assuming that the observations
within each class come from a normal distribution with a class-specific
mean vector and a common variance , and plugging estimates for these
parameters into the Bayes classifier.
LDA for p > 1
• We now extend the LDA classifier to the case of multiple
predictors.

• To do this, we will assume that is drawn from a multivariate

Gaussian (or multivariate normal) distribution, with a class-
specific multivariate mean vector and a common covariance
matrix
Multivariate Gaussian distribution

• The multivariate Gaussian distribution assumes that each individual predictor

follows a one-dimensional normal distribution, with some correlation between
each pair of predictors.

• Two examples of multivariate Gaussian distributions with p = 2 are shown in

the following Figure:
Multivariate Gaussian distribution
• The left-hand panel of Figure illustrates an example in which Var(X1) = Var(X2)
and Corr(X1,X2) = 0; this surface has a characteristic bell shape. The base of bell
will have circular shape.

• However, the bell shape will be distorted if the predictors are correlated or
have unequal variances, as is illustrated in the right-hand panel of Figure. In this
situation, the base of the bell will have an elliptical, rather than circular shape
Multivariate Gaussian distribution
• To indicate that a p-dimensional random variable X has a multivariate
Gaussian distribution, we write X ∼ N(μ, Σ). Here E(X) = μ is the mean of
X (a vector with p components), and Cov(X) = Σ is the p × p covariance
matrix of X.

• Formally, the multivariate Gaussian density is defined as

Covariance Matrix
In the top row, all bivariate Gaussian distributions have ρ=0 and look like a circle for
standard deviations of equal size. The top middle plot is stretched along X2, giving it
an elliptical shape. The middle and last row show how the distribution changes for
negative (ρ=−0.3) and positive (ρ=0.7) correlations
LDA for p > 1
• In the case of p > 1 predictors, the LDA classifier assumes that the
observations in the k’th class are drawn from a multivariate Gaussian
distribution N(,Σ), where is a class-specific mean vector, and Σ is a
covariance matrix that is common to all K classes.

• Plugging the density function for the kth class and performing a
little bit of algebra reveals that the Bayes classifier assigns an
observation X = x to the class for which
𝑇 −1 1 𝑇 −1
𝛿𝐾 𝑋 = 𝑋 Σ 𝜇 𝐾 − 𝜇 𝐾 Σ 𝜇 𝐾 + ln [ P ( 𝐶 𝐾 ) ]
( )
2
is largest.

• This is the vector/matrix version of discriminant function that we

have seen in LDA for p = 1.
LDA for p > 1
Note that there are three lines representing the Bayes decision boundaries
because there are three pairs of classes among the three classes: one Bayes
decision boundary separates class 1 from class 2, one separates class 1 from class
3, and one separates class 2 from class 3. These three Bayes decision boundaries
divide the predictor space into three regions.

The Bayes classifier will classify an observation according to the region in which it
is located.
LDA for p > 1
• Once again, we need to estimate the unknown parameters and Σ; the
formulas are similar to those used in the 1D case (LDA with p = 1).

• To assign a new observation X = x, LDA plugs these estimates into the

following equation and classifies to the class for which is largest.

𝑇 −1 1 𝑇 −1
𝛿𝐾 𝑋 = 𝑋 Σ 𝜇 𝐾 − 𝜇 𝐾 Σ 𝜇 𝐾 + ln [ P ( 𝐶 𝐾 ) ]
( )
2 function of x; i.e. the LDA
• Note that in the above eqn. is a linear
decision rule depends on x only through a linear combination of its
elements. This is the reason for the word linear in LDA.
20 observations drawn from each of the three classes are displayed, and the
resulting LDA decision boundaries are shown as solid black lines. Overall, the LDA
decision boundaries are pretty close to the Bayes decision boundaries, shown
again as dashed lines. The test error rates for the Bayes and LDA classifiers are
0.0746 and 0.0770, respectively. This indicates that LDA is performing well on this
data.
The ROC (receiver operating
characteristics) curve is a popular
graphic for simultaneously displaying
the ROC curve two types of errors for
all possible thresholds.
ROC Curves
• The overall performance of a classifier is given by the area under the ROC
curve or AUC.

• Along x-axis true positive rate (TPR) and along y-axis false positive rate
(FPR) are plotted. These are also called the sensitivity (plotted along Y-axis)
and 1-specificity (plotted along X-axis) of our classifier.

• An ideal ROC curve will touch the top left corner, so the larger the AUC the
better the classifier.

• For this data the AUC is 0.95, which is close to the maximum of 1, so would
be considered very good. We expect a classifier that performs better than
chance to have an AUC of 0.5 (when evaluated on an independent test set
not used in model training).

• ROC curves are useful for comparing different classifiers.

Quadratic Discriminant Analysis (QDA)

• As we have discussed, LDA (for p > 1) assumes that the

observations within each class are drawn from a multivariate
Gaussian distribution with a class specific mean vector and a
covariance matrix which is common to all K classes.

• Quadratic discriminant analysis (QDA) provides an alternative

classification approach.
Quadratic Discriminant Analysis (QDA)

• Similarity with LDA:

Like LDA, the QDA classifier results from assuming that the
observations from each class are drawn from a Gaussian distribution,
and plugging estimates for the parameters into Bayes’ theorem in
order to perform prediction.

• Dissimilarity with LDA:

Unlike LDA, QDA assumes that each class has its own
covariance matrix. That is, it assumes that an observation from the
k’th class is of the form , where is a covariance matrix for the k’th
class.
Quadratic Discriminant Analysis (QDA)
• Under this assumption, the Bayes classifier assigns an observation X = x
to the class for which

is largest.
𝑒 : 𝜋 𝐾 𝑖𝑠 𝑒𝑞𝑢𝑖𝑣𝑎𝑙𝑒𝑛𝑡 𝑡𝑜 𝑃 ( 𝐶 𝐾 )

• So the QDA classifier involves plugging estimates for , , and , and then
assigning an observation X = x to the class for which this quantity is
largest.

• Unlike in LDA, in QDA the quantity x appears as a quadratic function

of . This is why it is called QDA.
LDA Vs. QDA

Q. Why does it matter whether or not we assume that the K classes share
a common covariance matrix? In other words, why would one prefer LDA
to QDA, or vice-versa?

• The answer lies in the bias-variance trade-off.

• When there are p predictors, then estimating a covariance matrix

requires estimating p*(p+1)/2 parameters in case od LDA. QDA
estimates a separate covariance matrix for each class, for a total of
K*p*(p+1)/2 parameters.

• With 50 predictors this is some multiple of 1,275 in case of QDA, which

is a lot of parameters as compared to LDA
LDA Vs. QDA
• Instead of this, by assuming that the K classes share a common covariance
matrix, the LDA model becomes linear in x, which means there are linear
coefficients to estimate.

• Consequently, LDA is a much less flexible classifier than QDA, and so has
substantially lower variance. This can potentially lead to improved prediction
performance.

• But there is a trade-off: Since LDA assume that the K classes share a common
covariance matrix, it gives poor performance because LDA can suffer from high
bias.

• LDA tends to be a better than QDA if there are relatively few training
observations and so reducing variance is crucial.

• In contrast, QDA is recommended if the training set is very large, so that the
variance of the classifier is not a major concern.
LDA Vs. QDA
• Figure illustrates the performances of LDA and QDA in two scenarios. In the left-
hand panel, the two Gaussian classes have a common correlation of 0.7 between
X1 and X2. As a result, the Bayes decision boundary is linear and is accurately
approximated by the LDA decision boundary. The QDA decision boundary is
inferior, because it suffers from higher variance without a corresponding decrease
in bias.

• In contrast, the right-hand panel displays a situation in which the orange class has
a correlation of 0.7 between the variables and the blue class has a correlation of
−0.7. Now the Bayes decision boundary is quadratic, and so QDA more accurately
approximates this Bayes decision boundary than does LDA.
K-nearest neighbor (KNN)
• k-nearest neighbors algorithm (k-NN) is a non-parametric
classification method that is used for both classification and
regression.

• Simple, but a very powerful classification algorithm

• Classifies based on a similarity measure

• Whenever we have a new data to classify, we find its K nearest

neighbors from the training data

• All instances correspond to points in an n-dimensional Euclidean

space
K-nearest neighbor (KNN)
The algorithm’s learning is:

1. Instance-based learning: Here we do not learn weights from

training data to predict output (as in model-based algorithms)
but use entire training instances to predict output for unseen
data.

2. Lazy Learning: Model is not learned using training data prior and
the learning process is postponed to a time when prediction is
requested on the new instance.

3. Non -Parametric: In KNN, there is no predefined form of the

mapping function.
K-nearest neighbor (KNN)
How does KNN works?

• Let us say we have plotted data points from our training set on
2D feature space. Say, we have a total of 6 data points (3 red
and 3 blue).

• Red data points belong to ‘class1’ and blue data points belong to
‘class2’.
• Yellow data point in a feature space represents the new point for
which a class is to be predicted. Obviously, we say it belongs to
‘class1’ (red points)

• Why?
Because its nearest neighbors belong to that class!

This is the principle behind K Nearest Neighbors!

K-nearest neighbor (KNN)

• Here, nearest neighbors are those data points that have

minimum distance in feature space from the new data point
(test data).

• K is the number of such data points we consider in our

implementation of the algorithm.
1-Nearest Neighbor
3-Nearest Neighbor
K-nearest neighbor (KNN)

• The distance metric and K value are two important considerations

while using the KNN algorithm.

• Euclidean distance is the most popular distance metric. One can

also use Hamming distance, Manhattan distance, Minkowski
distance as per the need.

• For predicting class/ continuous value for a new data point, it

considers all the data points in the training dataset. Find ‘K’ Nearest
Neighbors (Data points) of a new data point from feature space and
the corresponding class labels.
K-nearest neighbor (KNN)

Then:

(i) For classification: the output is a class membership. An object is

classified by a plurality vote of its neighbors, with the object
being assigned to the class most common among its k nearest
neighbors (k is a positive integer, typically small). If k = 1, then
the object is simply assigned to the class of that single nearest
neighbor.

(ii) For regression: the output is the property value for the object.
For example, this value can be the mean or median of the values
of k nearest neighbors.
KNN classification approach
• Classified by “MAJORITY VOTES” for its neighbor classes .

• Assigned to the most common class amongst its K-nearest

neighbors (by measuring “distance” between data)
KNN algorithm
1. Load the data

2. Initialize K to your chosen number of neighbors

3. For each example in the data

3.1 Calculate the distance between the query example and the current
example from the data.
3.2 Add the distance and the index of the example to an ordered collection

4. Sort the ordered collection of distances and indices from smallest to largest (in
ascending order) by the distances

5. Pick the first K entries from the sorted collection

6. Get the labels of the selected K entries

7. If regression, return the mean of the K labels

8. If classification, return the mode of the K labels

Example of k-NN classification. The test sample (green dot) should be
classified either to blue squares or to red triangles.
 If k = 3 (solid line circle) it is assigned to the red triangles because there
are 2 triangles and only 1 square inside the inner circle.
 If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares
vs. 2 triangles inside the outer circle).
 Choosing the right value for K

To select the K that’s right for your data, we run the KNN algorithm
several times with different values of K and choose the K that reduces
the number of errors we encounter while maintaining the algorithm’s
ability to accurately make predictions on test data.
 Choosing the right value for K

K is a crucial parameter in the KNN algorithm. Some suggestions for

choosing K Value are:

1. Using error curves: The figure below shows error curves for different
values of K for training and test data.
 Choosing the right value for K
1. Using error curves: At low K values, there is overfitting of data/high variance.
Therefore test error is high and train error is low. At K=1 in train data, the error is
always zero, because the nearest neighbor to that point is that point itself.
Therefore though training error is low test error is high at lower K values. This is
called overfitting. As we increase the value for K, the test error is reduced.
 Choosing the right value for K

1. Using error curves:

But after a certain K value, bias/ underfitting is introduced and test error goes high.
So we can say initially test data error is high(due to variance) then it goes low and
stabilizes and with further increase in K value, it again increases(due to bias). The K
value when test error stabilizes and is low is considered as optimal value for K.
• From the above error curve we can choose K=8 for our KNN algorithm
implementation.
 Choosing the right value for K

1. Using error curves

2. Also, domain knowledge is very useful in choosing the K value.

3. It is seen empirically that K value should be odd while considering

binary(two-class) classification
A Comparison of Classification Methods

1.Logistic Regression
2.LDA
3.QDA
4.KNN
A Comparison of Classification Methods

• Though motivations differ, the logistic regression and LDA methods are
closely connected.

• Consider the two-class setting with p = 1 predictor, and let p1(x) and p2(x)
= 1−p1(x) be the probabilities that the observation X = x belongs to class 1
and class 2, respectively.

• In the LDA framework log odds is given by

where c0 and c1 are functions of μ1, μ2, and σ2.

A Comparison of Classification Methods

• we know that in logistic regression,

• Both equations for LDA and LR are linear functions of x.

• Hence, both logistic regression and LDA produce linear decision boundaries.

• The only difference between the two approaches lies in the fact that β0 and
β1 are estimated using maximum likelihood, whereas c0 and c1 are
computed using the estimated mean and variance from a normal
distribution.

• This same connection between LDA and logistic regression also holds for
multidimensional data with p > 1.
A Comparison of Classification Methods

• Since logistic regression and LDA differ only in their fitting

procedures, one might expect the two approaches to give similar
results. This is often, but not always, the case.

• LDA assumes that the observations are drawn from a Gaussian

distribution with a common covariance matrix in each class, and so
can provide some improvements over logistic regression.

• Conversely, logistic regression can outperform LDA if these Gaussian

assumptions are not met.
A Comparison of Classification Methods

• KNN takes a completely different approach from the classifiers like LR,
LDA and QDA. In order to make a prediction for an observation X = x,
the K training observations that are closest to x are identified. Then X
is assigned to the class to which the plurality of these observations
belong.

• Hence KNN is a completely non-parametric approach: no

assumptions are made about the shape of the decision boundary.

• Therefore, we can expect KNN approach to dominate LDA and logistic

regression when the decision boundary is highly non-linear.

• On the other hand, KNN does not tell us which predictors are
important
A Comparison of Classification Methods

• Hence KNN is a completely non-parametric approach: no

assumptions are made about the shape of the decision boundary.

• Therefore, we can expect this approach to dominate LDA and logistic

regression when the decision boundary is highly non-linear.

• On the other hand, KNN does not tell us which predictors are
important. It will not able to estimate parameter’s values.
A Comparison of Classification Methods

• QDA serves as a compromise between the non-parametric KNN

method and the linear LDA and logistic regression approaches.

• Since QDA assumes a quadratic decision boundary, it can accurately

model a wider range of problems than the linear methods.

• Though not as flexible as KNN, QDA can perform better in the

presence of a limited number of training observations because it does
make some assumptions about the form of the decision boundary.
Assignment-1 (10 marks)

Download stock market data or caravan insurance data

(from kaggle) and perform Linear regression, Logistic
regression, LDA, QDA and KNN. Also compare the results of
different methodologies

Last date of submission: October 3, 2021 (submit in google

classroom)

Format of report:
1. Title,
2. Code,
3. Result (performance measures),
4. comparison graph
The dashed lines are the Bayes decision boundaries. In other words, they
represent the set of values x for which ; i.e.

𝑓𝑜𝑟 ,𝑘 ≠ 𝑙

The log term from has disappeared because each of the three classes has the
same number of training observations; i.e. is the same for each class.

AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Unit-4 DS Student
No ratings yet
Unit-4 DS Student
43 pages
Machine Learning (ML) RIME-832: Dr. Hasan Sajid
No ratings yet
Machine Learning (ML) RIME-832: Dr. Hasan Sajid
57 pages
02 LR
No ratings yet
02 LR
11 pages
MachineLearning Unit-II
No ratings yet
MachineLearning Unit-II
45 pages
ML Unit2
No ratings yet
ML Unit2
69 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Unit-2 ML
No ratings yet
Unit-2 ML
39 pages
LEC2 مشين
No ratings yet
LEC2 مشين
116 pages
3CP10 Final MJJ Linear Regression
No ratings yet
3CP10 Final MJJ Linear Regression
68 pages
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
No ratings yet
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
60 pages
ML Section2
No ratings yet
ML Section2
36 pages
Lec 6
No ratings yet
Lec 6
19 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
Chapter - 2 - Linear and Logistic Regression
No ratings yet
Chapter - 2 - Linear and Logistic Regression
34 pages
Data Science
100% (1)
Data Science
14 pages
Unit 3c Linear Regression
No ratings yet
Unit 3c Linear Regression
98 pages
Linear Regression For Machine Learning
No ratings yet
Linear Regression For Machine Learning
9 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Linear-Regression ML
No ratings yet
Linear-Regression ML
36 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
6 ML Updated
No ratings yet
6 ML Updated
23 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Simple Linear and Logistic Regression
No ratings yet
Simple Linear and Logistic Regression
81 pages
5 - AML Lecture 5 - Linear Regression
No ratings yet
5 - AML Lecture 5 - Linear Regression
56 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
Linear Regression
No ratings yet
Linear Regression
26 pages
Linear Regression
100% (1)
Linear Regression
8 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
Mod3 Eda
No ratings yet
Mod3 Eda
16 pages
Machine Learning and Deep Learning Course
No ratings yet
Machine Learning and Deep Learning Course
23 pages
ML Linear Regression Trupesh Patel
No ratings yet
ML Linear Regression Trupesh Patel
23 pages
10.introduction To Artificial Intelligence
No ratings yet
10.introduction To Artificial Intelligence
25 pages
Unit 2
No ratings yet
Unit 2
19 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
33 pages
Linear Regression - Everything You Need To Know About Linear Regression
No ratings yet
Linear Regression - Everything You Need To Know About Linear Regression
17 pages
Regression
No ratings yet
Regression
44 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
02 Regression and Classification Problems
No ratings yet
02 Regression and Classification Problems
7 pages
ML 02 Regression 2
No ratings yet
ML 02 Regression 2
30 pages
Machine Learning Class Slide
No ratings yet
Machine Learning Class Slide
44 pages
OE-ML Unit - 3
No ratings yet
OE-ML Unit - 3
29 pages
Hanan
No ratings yet
Hanan
9 pages
Linear Regression
No ratings yet
Linear Regression
24 pages
Linear Regression
No ratings yet
Linear Regression
24 pages
ML Unit
No ratings yet
ML Unit
23 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Chapter4 Regression
No ratings yet
Chapter4 Regression
15 pages
Linear Regression
No ratings yet
Linear Regression
34 pages
3.1 Linear and Logistic Regression
No ratings yet
3.1 Linear and Logistic Regression
36 pages
CSL0777 L12
No ratings yet
CSL0777 L12
18 pages
MachineLearning Unit II
No ratings yet
MachineLearning Unit II
45 pages
Linear Regression: Student: Mohammed Abu Musameh Supervisor: Eng. Akram Abu Garad
No ratings yet
Linear Regression: Student: Mohammed Abu Musameh Supervisor: Eng. Akram Abu Garad
35 pages
Resampling Methods - ML
No ratings yet
Resampling Methods - ML
115 pages
Bias-Variance Trade-Off
No ratings yet
Bias-Variance Trade-Off
28 pages
Module 4 SVM PCA Kmeans
No ratings yet
Module 4 SVM PCA Kmeans
101 pages
Module 2 - DS I
No ratings yet
Module 2 - DS I
94 pages
Orientation - Basic Mathematics and Statistics - Probability
No ratings yet
Orientation - Basic Mathematics and Statistics - Probability
48 pages
Orientation - Basic Mathematics and Statistics - CG
No ratings yet
Orientation - Basic Mathematics and Statistics - CG
15 pages
Orientation - Basic Mathematics and Statistics - ND
No ratings yet
Orientation - Basic Mathematics and Statistics - ND
33 pages
Orientation - Basic Mathematics and Statistics - CTD
No ratings yet
Orientation - Basic Mathematics and Statistics - CTD
35 pages
Business+Report - Ensemble1 Lavekar
No ratings yet
Business+Report - Ensemble1 Lavekar
32 pages
Flower Recog System
No ratings yet
Flower Recog System
11 pages
Feature Extraction Techniques and Classification Algorithms For EEG Signals To Detect Human Stress - A Review
No ratings yet
Feature Extraction Techniques and Classification Algorithms For EEG Signals To Detect Human Stress - A Review
7 pages
TE - 2019 - (AIML) Artificial Intelligence and Machine Learning
No ratings yet
TE - 2019 - (AIML) Artificial Intelligence and Machine Learning
4 pages
Landscape and Fragmentation Analysis: Patch Analyst Patch Analyst (Grid)
No ratings yet
Landscape and Fragmentation Analysis: Patch Analyst Patch Analyst (Grid)
34 pages
Efficient Machine Learning On Edge Computing Through Data Compression Techniques
No ratings yet
Efficient Machine Learning On Edge Computing Through Data Compression Techniques
10 pages
Phishing Attacks Surge During COVID-19, Targeting Individuals and Organizations Globally. Cybercriminals Use Social Engineering To Trick Users Into Sharing Sensitive Data, Emphasizing The Need Fo
No ratings yet
Phishing Attacks Surge During COVID-19, Targeting Individuals and Organizations Globally. Cybercriminals Use Social Engineering To Trick Users Into Sharing Sensitive Data, Emphasizing The Need Fo
15 pages
49 Machine Learning
No ratings yet
49 Machine Learning
300 pages
Exploring Amharic Sentiment Analysis From Social Media Texts: Building Annotation Tools and Classification Models
No ratings yet
Exploring Amharic Sentiment Analysis From Social Media Texts: Building Annotation Tools and Classification Models
13 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
11 pages
Artificial Intelligence-Based Traffic Flow Predict
No ratings yet
Artificial Intelligence-Based Traffic Flow Predict
50 pages
Medical Concept Representation Learning From Electronic Health Records and Its Application On Heart Failure Prediction
No ratings yet
Medical Concept Representation Learning From Electronic Health Records and Its Application On Heart Failure Prediction
45 pages
Final Thesis On ETHIOPIAN COFFEE DISEASE IDENTIFICATION USING IMAGE ANALYSIS Amanuel Tamiru
No ratings yet
Final Thesis On ETHIOPIAN COFFEE DISEASE IDENTIFICATION USING IMAGE ANALYSIS Amanuel Tamiru
73 pages
Mulberry Leaf Disease Detection
No ratings yet
Mulberry Leaf Disease Detection
4 pages
Research Proposal
No ratings yet
Research Proposal
6 pages
Ue21cs352a 20230830121009
No ratings yet
Ue21cs352a 20230830121009
42 pages
G20 - Crowdfunding Predicting Kickstarter Project Success
No ratings yet
G20 - Crowdfunding Predicting Kickstarter Project Success
7 pages
Conference
No ratings yet
Conference
24 pages
Mca1to6 New
No ratings yet
Mca1to6 New
28 pages
110.detection of Lung Cancer From CT Image Using SVM Classification and Compare The Survival Rate of Patients Using 3D Convolutional Neural Network (3D CNN) On Lung Nodules Data Set
No ratings yet
110.detection of Lung Cancer From CT Image Using SVM Classification and Compare The Survival Rate of Patients Using 3D Convolutional Neural Network (3D CNN) On Lung Nodules Data Set
12 pages
K Nearest Neighbours
No ratings yet
K Nearest Neighbours
12 pages
Roberts Ryan Machine Learning The Ultimate Beginners Guide F
No ratings yet
Roberts Ryan Machine Learning The Ultimate Beginners Guide F
45 pages
ESELAB2 Merged
No ratings yet
ESELAB2 Merged
43 pages
AI UNIT - 4 Notes
No ratings yet
AI UNIT - 4 Notes
9 pages
Comparison of Malware Classification Methods Using Convolutional Neural Network Based On Api Call Stream
No ratings yet
Comparison of Malware Classification Methods Using Convolutional Neural Network Based On Api Call Stream
19 pages
A Deep Learning Approach Towards Student Performance Prediction in Online Courses Challenges Based On A Global Perspective
No ratings yet
A Deep Learning Approach Towards Student Performance Prediction in Online Courses Challenges Based On A Global Perspective
6 pages
Machine Learning Based Predicting House Prices Using Regression Techniques
No ratings yet
Machine Learning Based Predicting House Prices Using Regression Techniques
7 pages
ML Question BanK
No ratings yet
ML Question BanK
5 pages
TSP
No ratings yet
TSP
22 pages
Nonparametric Density Estimation Nearest Neighbors, KNN
No ratings yet
Nonparametric Density Estimation Nearest Neighbors, KNN
31 pages