0% found this document useful (0 votes)
18 views185 pages

ML Module 2

The document discusses linear regression models including simple linear regression with one variable, multiple linear regression with multiple variables, and the use of gradient descent to estimate the coefficients in linear regression models by minimizing a cost function. It also covers topics like classification methods including logistic regression, linear discriminant analysis, quadratic discriminant analysis, and K-nearest neighbors classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views185 pages

ML Module 2

The document discusses linear regression models including simple linear regression with one variable, multiple linear regression with multiple variables, and the use of gradient descent to estimate the coefficients in linear regression models by minimizing a cost function. It also covers topics like classification methods including logistic regression, linear discriminant analysis, quadratic discriminant analysis, and K-nearest neighbors classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 185

ML- Module-2

Dr. Debanjali Bhattacharya


Syllabus and Reference

MODULE 2 - Linear Regression: Simple Linear Regression, Multiple


Linear Regression. Classification: Logistic Regression, Linear
Discriminant Analysis (LDA): Bayes Theorem for classification,
Quadratic Discriminant Analysis (QDA), KNN Method, Comparison
of classification methods.

Reference:
• James G., Witten D., Hastie T., Tibshirani R. “An Introduction to
Statistical Learning with Applications in R”, Springer Texts in
Statistics.
• Machine learning by Andrew Ng.
Linear Regression
Linear Regression

• Uses Least-square approach to fit the model on


dataset

• Linear regression with one variable


–> Univariate linear regression

• Linear regression with multiple variable


–> Multiple linear regression/ multivariate linear
regression
Linear Regression

• Simple linear regression : A very


straightforward approach for
predicting a quantitative response Y
on the basis of a single predictor
variable X.

• It assumes that there is approximately


a linear relationship between X and
Y.
Linear Regression

Mathematically, we can write this linear relationship as

• β0 and β1 are two unknown constants that represent the


intercept and slope terms in the linear model.

• Together, β0 and β1 are intercept and slope- known as the model


coefficients or parameters.
Linear Regression

• Once we have used our coefficient parameter training data to


produce estimates and for the model coefficients, we can predict
future output by computing

Where, indicates prediction of Y on the basis of X = x,


• hat symbol ( ˆ) is used to denote the estimated value for an
unknown parameter or coefficient, or to denote the predicted
value of the response.
Simulated data set. Left: The red line represents the true relationship, f(X) = 3X+2,
which is known as the population regression line. The blue line is the least squares
line; it is the least squares estimate for f(X) based on the observed data, shown in
black.
Right: The population regression line is again shown in red, and the least squares line
in dark blue. In light blue, ten least squares lines are shown, each computed on the
basis of a separate random set of observations. Each least squares line is different, but
on average, the least squares lines are quite close to the population regression line.
Simulated data set. Left: The red line represents the true relationship, f(X) = 3X+2, which is
known as the population regression line. The blue line is the least squares line; it is the least
squares estimate for f(X) based on the observed data, shown in black.
Right: The population regression line is again shown in red, and the least squares line in dark
blue. In light blue, ten least squares lines are shown, each computed on the basis of a separate
random set of observations. Each least squares line is different, but on average, the least
squares lines are quite close to the population regression line.

Goal: To find the straight line that best fit the data
Goal: To find the straight line that best fit the data

 Regression line is the line that “best fit” the data.

 Best fit – is defined from method of least-squares.


Linear Regression
• How to estimate the coefficient?

X
Linear regression with one variable
Linear regression with one variable

Regression: Predict the real valued output


Linear regression with one variable

(x, y) – a single training example

- i’th training example


Linear regression with one variable

X Hypothesis Estimate values of y


Linear regression with one variable
How to represent ‘h’ ?

𝒉𝜽 ( 𝒙 )=𝜽 𝟎+ 𝜽 𝟏 𝒙

(y)

(x)
Linear regression with one variable
Linear regression with one variable

h(x) = 0.5*x + 1
h(x) = 0.5*x + 0
h(x) = 0*x + 1.5
Linear regression with one variable

𝑚
1
𝐽 ( 𝜃 0 , 𝜃1 ) = ∑ (h𝜃 ( 𝑥 ) − ¿ 𝑦 ) ¿
(𝑖) (𝑖) 2
2𝑚 𝑖=1
𝒉𝜽 ( 𝒙)

Cost Function

Cost Function = squared error function


Linear regression with one variable

𝒉𝜽 ( 𝒙)
Linear regression with one variable:
Cost function

𝜃 1=1

𝑚
1 𝐽 ( 1 ) =0
𝜃 0 , 𝜃1 ) = ∑
2 𝑚 𝑖=1
(𝑖) (𝑖) 2
(h𝜃 ( 𝑥 ) − ¿ 𝑦 ) =0 ¿
𝐽 ( 0.5 ) =?
Linear regression with one variable:
Cost function

𝜃 1= 0
𝜃 1=1

.58 𝐽 ( 0.5 )= 0.58


𝐽 ( 0 ) =?
Linear regression with one variable:
Cost function

𝜃 1= 0.5

𝜃 1= 0
𝜃 1=1

𝐽 ( 0.5 )= 2.3
Linear regression with one variable:
Cost function

𝜃 1=1

Plot for different values of θ )


Linear regression with one variable:
Cost function

𝜽𝟏
𝜃0 =50 ; 𝜃 1=0.06 • When we have single parameter
, the cost function is a bowl
shaped.

• How it will look like if we have


multiple parameters , ?
Linear regression with one variable:
Cost function

Contour Plot

J
Linear regression with one variable:
Cost function

J
Linear regression with one variable:
Cost function

h(x)

-0.15

800

h(x) = -0.15*x + 800 𝜃0 =800 ; 𝜃1=− 0.15


Linear regression with one variable:
Cost function

h(x)

h(x) = 0*x + 360 𝜃 0 =360 ; 𝜃 1=0


Linear regression with one variable:
Cost function

h(x)
Linear regression with one variable:
Cost function

h(x)
Cost function:
Gradient descent algorithm

}
Start with some value of
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…

Local Minima
Start with some value of
Start with some value of
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…

Local Minima
Converges to local minima…
Derivative term

α is the learning rate


𝑱 ( 𝜽𝟏 )

𝑱 ( 𝜽𝟏 )
• Gradient descent can converge to a local minimum, even with the
learning rate α fixed

• As we approach a local minimum, gradient descent will


automatically take smaller steps. So, no need to decrease α over
time.
Gradient descent algorithm for Linear
regression (with one variable)
Gradient descent algorithm for Linear
regression (with one variable)
Gradient descent algorithm for Linear
regression (with one variable)

𝜹
𝑱 (𝜽 𝟎 , 𝜽𝟏 )
𝜹 𝜽𝟎

𝜹
𝑱 ( 𝜽 𝟎 , 𝜽𝟏 )
𝜹 𝜽𝟏
Update simultaneously
Local Minima
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Multivariate Linear
Regression
 Linear regression with multiple variable
(multiple features): Multiple linear
regression
Linear Regression with multiple variables
Linear Regression with multiple variables

Y
X1 X2 X3 X4
# m = training
example

𝟏 𝟓𝟑𝟒
𝑿 ( 𝟑) =⌈ 𝟑 ⌉∈𝑹
n=4 𝟐
𝟑𝟎

=?
Linear Regression with multiple variables

+ ++…+

Example:
+ 3* + 0.01* + … + 2*
X θ

+ +…+
=
𝑇
𝜃 =[ 𝜃 1 𝜃 2 . . . 𝜃 𝑛 ]
Gradient descent for multiple variables
𝒙 𝟎 =𝟏

𝑱 (𝜽)

𝑱 (𝜽)

𝑱 (𝜽)
Gradient descent for multiple variables
𝜹
𝑱 (𝜽
𝜹𝜽

𝜹
𝑱 ( 𝜽)
𝜹 𝜽𝟎
𝑱 ( 𝜽 ) 𝒔𝒉𝒐𝒖𝒍𝒅 𝒅𝒆𝒄𝒓𝒆𝒔𝒆 𝒂𝒇𝒕𝒆𝒓 𝒆𝒗𝒆𝒓𝒚 𝒊𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏
θ
θ

Declare convergence if J(θ) decreases by


less than
𝑱 (𝜽)

𝑱 (𝜽)
• For sufficiently small α, J(θ) should decrease on every iteration.

• But if α is too small, gradient descent can be slow to converge.

• If α is too large J(θ) may not decrease on each iteration; may not converge.

• Choose the value of α carefully (choose α = 0.001, 0.003, 0.01, 0.03, 0.3, 0.5)
Non-linear regression
(Polynomial regression)
House Price Prediction

+ +

𝑿𝟏 𝑿𝟐

Include
=

h 𝜃 ( 𝑋 ) = 𝜃 0 + 𝜃 1 𝑋 1 + 𝜃 2 𝑋 2+ 𝜃 3 𝑋 3
Polynomial features:
Regularization
Bias & Variance
Logistic regression
Logistic Regression
• Logistic regression: A method that is used to predict qualitative
responses (known as classification).

• Predicting a qualitative response for an observation can be


referred to as classifying that observation, since it involves
assigning the observation to a particular category, or class.

• Often the methods used for classification, first predict the


probability of each of the categories of a qualitative variable,
belonging to a particular class, as the basis for making the
classification. In this sense they also behave like regression
methods.
• Some widely used techniques to predict
qualitative or categorical response is

 Logistic Regression
 Linear discriminant analysis (LDA)
 K-nearest neighbor (KNN)
Logistic Regression
Logistic Regression
Classification
boundary
2 examples of malignant tumor are misclassified
Classification
boundary
2 examples of malignant tumor are misclassified
Classification
boundary

Applying linear regression to


the classification problem is
often not a great idea!!
Using linear regression for classification

• To avoid this problem, we must model using a function that


gives outputs between 0 and 1 for all values of X.
Logistic regression model

Want

1
h 𝜃 ( 𝑋 )= −𝜃
𝑇
𝑋
1 +𝑒
𝑇
𝜃 𝑋
𝑒
h𝜃 ( 𝑥) = 𝜃
𝑇
𝑋
1+ 𝑒
Logistic regression model
Logistic regression: Decision boundary
Logistic regression: Decision boundary
Logistic regression: Decision boundary

Y=1

Y=0
Logistic regression: Decision boundary

Non-linear decision boundary


Logistic regression: Cost function

Nonlinear function
𝑇

𝑒 𝑋
𝜃
h𝜃 ( 𝑥) = 𝜃 𝑋
𝑇

1+ 𝑒
Logistic regression: Cost function
𝑇
𝜃 𝑋
𝑒
h𝜃 ( 𝑥) = 𝑇

1+ 𝑒 𝜃 𝑋

h𝜃( 𝑥) 𝜃
𝑇
𝑋
=𝑒
1 − h𝜃 ( 𝑥 )

ln

The left-hand side is called the log-odds or logit. We see that the
logistic regression model has a logit that is linear in X.
Logistic regression: Cost function
Linear Regression:

Logistic Regression:

WE WANT

Square cost function with non-linearity


does not give guarantee to converge to
the local minimum
Logistic regression: Cost function
In LR, we define a different cost function which is convex so that
we can apply optimization algorithm like gradient descent to find
global minima.

Cost
function
Logistic regression: Cost function

In LR, we define a different cost function which is convex so that


we can apply optimization algorithm like gradient descent.

Cost
function
Logistic regression: Cost function

In LR, we define a different cost function which is convex so that


we can apply optimization algorithm like gradient descent.

Putting together
Logistic regression: Cost function

Get θ
Logistic regression: Cost function
Logistic regression:
Multi-class classification (one-Vs.-all)

y=1 y=2 y=3 y=4

y=1 y=2 y=3


Logistic regression:
Multi-class classification (one-Vs.-all)
Logistic regression:
Multi-class classification (one-Vs.-all)
Linear Discriminant Analysis
(LDA)
Linear Discriminant Analysis (LDA)

• Logistic regression involves directly modeling the probability


using the logistic sigmoid function, for the case of two
response classes.

• In statistics, we model the conditional distribution of the


response Y , given the predictor(s) X.

• We now consider an alternative and less direct approach to


estimating these probabilities.
Linear Discriminant Analysis (LDA)
• Alternative approach:
1. Model the distribution of the predictors X separately
in each of the response classes (i.e. given Y ).
2. Then use Bayes’ theorem to flip these around into
estimates for

• When these distributions are assumed to be normal, it turns


out that the model is very similar in form to logistic
regression.

• Why do we need another method, when we have logistic


regression?
Why do we need another method, when we have logistic regression?

There are several reasons:

1. When the classes are well-separated, the parameter estimates for the
logistic regression model are surprisingly unstable. LR has the peculiar
behavior that if a feature separates the classes perfectly, the
coefficients go to infinity. LR works better when the classes are not
well separable. LDA does not suffer from this problem.

2. If n is small (small dataset) and the distribution of the predictors X is


approximately normal in each of the classes, the linear discriminant
model is again more stable than the logistic regression model.

3. LDA is popular when we have more than two response classes.


Linear Discriminant Analysis (LDA)
Bayes’ Classifier:
• It is seen that the test error rate given in classification settings is
minimized, on average, by a very simple classifier that assigns each
observation to the most likely class, given its predictor values.

• We can simply assign a test observation with predictor vector to the


class K for which = is largest.

• This is also called conditional probability: it is the probability


conditional that Y = k, given the observed predictor vector .This very
simple classifier is called the Bayes classifier.
Linear Discriminant Analysis (LDA)
Bayes’ Theorem:
• Using Bayes theorem the conditional probability can be
written as
(i) (Proof…)

where, is the number of classes for K = 1, 2, …,n

And (ii) (Proof…)

Therefore, 𝑃 ( 𝐶 𝐾 ) 𝑃 ( 𝑋 ∨𝐶 𝐾 )
P ( 𝐶 𝐾| 𝑋 ) = 𝑛

∑ 𝑃 ( 𝑋 |𝐶 𝐾 ) 𝑃 ( 𝐶 𝐾 )
𝑘=1
Linear Discriminant Analysis (LDA)
Bayes’ Classifier:
• The Bayes theorem states that

𝑃 ( 𝐶 𝐾 ) 𝑃 ( 𝑋 ∨𝐶 𝐾 )
P ( 𝐶 𝐾| 𝑋 ) = 𝑛

∑ 𝑃 ( 𝑋 |𝐶 𝐾 ) 𝑃 ( 𝐶 𝐾 )
𝑘=1

Here, denotes the prior probability that an observation belongs


to the k’th class.

• In a two-class problem where there are only two possible response


values, say class-1 or class-2, the Bayes classifier corresponds to
predicting class-1 if , and class-2 otherwise.
Linear Discriminant Analysis (LDA)
Bayes’ Classifier:
• Using Bayes theorem the conditional probability can be written as
Linear Discriminant Analysis (LDA)
Naïve Bayes’ Classifier:
• In ML Naïve Bayes classifier are a family of simple probabilistic
classifier which is based on Bayes theorem with strong (naïve)
independence assumptions between features, i.e. each
feature is conditionally independent of every other feature
given the category or class C.

• This means

Expression of
probability density
function (PDF) of
Naïve Bayes classifier
Linear Discriminant Analysis (LDA)
Naïve Bayes’ Classifier:

Expression of
probability density
function (PDF) of
Naïve Bayes classifier

The joint distribution of the sample is equal to product of conditional


probabilities if are independent to each other
Linear Discriminant Analysis (LDA)
Using Bayes’ Theorem for Classification

• We know that the Bayes classifier, which classifies an observation


to the class for which = is largest.

• Therefore, if we can find a way to estimate , then we can develop a


classifier that approximates the Bayes classifier.

• Such an approach is LDA.

• LDA (Gaussian Bayes classifier) is a method in statistics to find the


linear combination of features that separates two or more classes
of object.
LDA for p = 1

• Assume that p = 1, that is, we have only one predictor.

• We would like to obtain an estimate for that we can plug into


Bayes theorem in order to estimate

• We will then classify an observation to the class for which is


greatest. In order to estimate , we will first make some
assumptions about its form.
LDA for p = 1

We assume that is normal or Gaussian. In the one dimensional setting, the


normal PDF takes the form

( )
2
1 𝑥 −𝜇 𝐾

1 2 𝜎𝐾
𝑃 ( 𝑥𝑖|𝐶 𝐾 )= 𝑒
𝜎𝐾 √ 2 𝜋

where and are the mean and variance parameters for the class.

For now, let us further assume that : that is, there is a shared variance
term across all K classes, which for simplicity we can denote by .
LDA for p = 1
( )
2
1 𝑥 −𝜇 𝐾

1 2 𝜎𝐾
𝑃 ( 𝑥𝑖|𝐶 𝐾 )=
Substituting the equation 𝑒
𝜎𝐾 √ 2 𝜋
in the equation of Bayes’ theorem (stated earlier) we get,

( )
2
1 𝑥− 𝜇𝐾

1 2 𝜎𝐾
𝑃 ( 𝐶𝐾 ) 𝑒
𝜎 𝐾 √2 𝜋
P ( 𝐶 𝐾| 𝑋 ) =
( )
2
𝑛 1 𝑥 − 𝜇𝐾

1
∑ 𝑃 (𝐶 𝐾 ) 𝜎 √ 2 𝜋 𝑒
2 𝜎𝐾

𝑖 =1 𝐾

Here, denotes the prior probability that an observation belongs to


the k’th class. denotes the posterior probability
LDA for p = 1
( )
2
1 𝑥 − 𝜇𝐾

1 2 𝜎𝐾
𝑃 ( 𝐶𝐾 ) 𝑒
𝜎𝐾 √2 𝜋
P ( 𝐶 𝐾| 𝑋 ) =
( )
2
1 𝑥 − 𝜇𝐾
𝑛 −
1
∑ 𝑃 (𝐶 𝐾 )
𝜎 𝐾 √2 𝜋
𝑒
2 𝜎𝐾

𝑖 =1

• The Bayes classifier involves assigning an observation X = x to the class


for which (obtained from above eqn.) is largest.

• Taking the log of the above eqn. and rearranging the terms, it can be
shown that this is equivalent to assigning the observation to the class
for which is largest
LDA for p = 1
( )
2
1 𝑥 − 𝜇𝐾

1 2 𝜎𝐾
𝑃 ( 𝐶𝐾 ) 𝑒
𝜎𝐾 √2 𝜋
P ( 𝐶 𝐾| 𝑋 ) =
( )
2
1 𝑥 − 𝜇𝐾
𝑛 −
1
∑ 𝑃 (𝐶 𝐾 )
𝜎𝐾 √2 𝜋
𝑒
2 𝜎𝐾

𝑖 =1

• The Bayes classifier involves assigning an observation X = x to the class


for which (obtained from above eqn.) is largest.

• Taking the log of the above eqn. and rearranging the terms, it can be
shown that this is equivalent to assigning the observation to the class
for which is largest

• Generally for 2-classes, we look into the log ratio of their posterior
probability:
LDA for p = 1

• For instance, if K = 2 and the prior probability that an observation


belongs to a specific class is , then it can be shown that the Bayes
classifier assigns an observation to class 1 if
, and to class 2 otherwise.

(Prove it !...)
LDA for p = 1

• For instance, if K = 2 and the prior probability that an observation


belongs to a specific class is , then it can be shown that the Bayes
classifier assigns an observation to class 1 if
, and to class 2 otherwise.

• In this case, the Bayes decision boundary corresponds to the point


where
LDA for p = 1
• The two normal density functions that are displayed, represent two distinct
classes.

• The mean and variance parameters for the two density functions are as
follows:

• The two densities overlap, and so given that X = x, there is some uncertainty
about the class to which the observation belongs.
If we assume that an observation is equally likely to come from either
class—that is, then from eqn. , we see that the Bayes classifier assigns the
observation to class 1 if x < 0 and class 2 otherwise.
LDA for p = 1
• In practice, even if we are quite certain of our assumption that X is drawn
from a Gaussian distribution within each class, we still have to estimate the
parameters .

• The LDA approximates the Bayes classifier by using the following estimates:

Where, n is the total number of training


observations, and
is the number of training
observations in the k’th class.

𝑛𝐾
𝑃 (𝐶 𝐾 )=
𝑛
• Sometimes we have knowledge of the class membership probabilities , which
can be used directly. In the absence of any additional information, LDA
estimates using the proportion of the training observations that belong to the
k’th class.
LDA for p = 1
𝑛𝐾
𝑃 (𝐶 𝐾 )=
𝑛

The LDA classifier plugs all these estimates given into


and assigns an observation X = x to the class for which
is maximum
LDA for p = 1

^ ^ 𝟐
𝝁 𝝁𝑲
+ 𝐥 𝒏 ⁡[ 𝑷 ( 𝑪 𝑲 ) ]= ^
𝑲
− 𝜹𝑲 ( 𝒙 )
^
𝝈 𝟐 ^
𝟐𝝈 𝟐

• The word ‘linear’ in the classifier’s name stems from the fact
that the discriminant functions in the above are linear functions
of x.

• Summary (p = 1):
The LDA classifier results from assuming that the observations
within each class come from a normal distribution with a class-specific
mean vector and a common variance , and plugging estimates for these
parameters into the Bayes classifier.
LDA for p > 1
• We now extend the LDA classifier to the case of multiple
predictors.

• To do this, we will assume that is drawn from a multivariate


Gaussian (or multivariate normal) distribution, with a class-
specific multivariate mean vector and a common covariance
matrix
Multivariate Gaussian distribution

• The multivariate Gaussian distribution assumes that each individual predictor


follows a one-dimensional normal distribution, with some correlation between
each pair of predictors.

• Two examples of multivariate Gaussian distributions with p = 2 are shown in


the following Figure:
Multivariate Gaussian distribution
• The left-hand panel of Figure illustrates an example in which Var(X1) = Var(X2)
and Corr(X1,X2) = 0; this surface has a characteristic bell shape. The base of bell
will have circular shape.

• However, the bell shape will be distorted if the predictors are correlated or
have unequal variances, as is illustrated in the right-hand panel of Figure. In this
situation, the base of the bell will have an elliptical, rather than circular shape
Multivariate Gaussian distribution
• To indicate that a p-dimensional random variable X has a multivariate
Gaussian distribution, we write X ∼ N(μ, Σ). Here E(X) = μ is the mean of
X (a vector with p components), and Cov(X) = Σ is the p × p covariance
matrix of X.

• Formally, the multivariate Gaussian density is defined as

Covariance Matrix
In the top row, all bivariate Gaussian distributions have ρ=0 and look like a circle for
standard deviations of equal size. The top middle plot is stretched along X2, giving it
an elliptical shape. The middle and last row show how the distribution changes for
negative (ρ=−0.3) and positive (ρ=0.7) correlations
LDA for p > 1
• In the case of p > 1 predictors, the LDA classifier assumes that the
observations in the k’th class are drawn from a multivariate Gaussian
distribution N(,Σ), where is a class-specific mean vector, and Σ is a
covariance matrix that is common to all K classes.

• Plugging the density function for the kth class and performing a
little bit of algebra reveals that the Bayes classifier assigns an
observation X = x to the class for which
𝑇 −1 1 𝑇 −1
𝛿𝐾 𝑋 = 𝑋 Σ 𝜇 𝐾 − 𝜇 𝐾 Σ 𝜇 𝐾 + ln [ P ( 𝐶 𝐾 ) ]
( )
2
is largest.

• This is the vector/matrix version of discriminant function that we


have seen in LDA for p = 1.
LDA for p > 1
Note that there are three lines representing the Bayes decision boundaries
because there are three pairs of classes among the three classes: one Bayes
decision boundary separates class 1 from class 2, one separates class 1 from class
3, and one separates class 2 from class 3. These three Bayes decision boundaries
divide the predictor space into three regions.

The Bayes classifier will classify an observation according to the region in which it
is located.
LDA for p > 1
• Once again, we need to estimate the unknown parameters and Σ; the
formulas are similar to those used in the 1D case (LDA with p = 1).

• To assign a new observation X = x, LDA plugs these estimates into the


following equation and classifies to the class for which is largest.

𝑇 −1 1 𝑇 −1
𝛿𝐾 𝑋 = 𝑋 Σ 𝜇 𝐾 − 𝜇 𝐾 Σ 𝜇 𝐾 + ln [ P ( 𝐶 𝐾 ) ]
( )
2 function of x; i.e. the LDA
• Note that in the above eqn. is a linear
decision rule depends on x only through a linear combination of its
elements. This is the reason for the word linear in LDA.
20 observations drawn from each of the three classes are displayed, and the
resulting LDA decision boundaries are shown as solid black lines. Overall, the LDA
decision boundaries are pretty close to the Bayes decision boundaries, shown
again as dashed lines. The test error rates for the Bayes and LDA classifiers are
0.0746 and 0.0770, respectively. This indicates that LDA is performing well on this
data.
The ROC (receiver operating
characteristics) curve is a popular
graphic for simultaneously displaying
the ROC curve two types of errors for
all possible thresholds.
ROC Curves
• The overall performance of a classifier is given by the area under the ROC
curve or AUC.

• Along x-axis true positive rate (TPR) and along y-axis false positive rate
(FPR) are plotted. These are also called the sensitivity (plotted along Y-axis)
and 1-specificity (plotted along X-axis) of our classifier.

• An ideal ROC curve will touch the top left corner, so the larger the AUC the
better the classifier.

• For this data the AUC is 0.95, which is close to the maximum of 1, so would
be considered very good. We expect a classifier that performs better than
chance to have an AUC of 0.5 (when evaluated on an independent test set
not used in model training).

• ROC curves are useful for comparing different classifiers.


Quadratic Discriminant Analysis (QDA)

• As we have discussed, LDA (for p > 1) assumes that the


observations within each class are drawn from a multivariate
Gaussian distribution with a class specific mean vector and a
covariance matrix which is common to all K classes.

• Quadratic discriminant analysis (QDA) provides an alternative


classification approach.
Quadratic Discriminant Analysis (QDA)

• Similarity with LDA:

Like LDA, the QDA classifier results from assuming that the
observations from each class are drawn from a Gaussian distribution,
and plugging estimates for the parameters into Bayes’ theorem in
order to perform prediction.

• Dissimilarity with LDA:

Unlike LDA, QDA assumes that each class has its own
covariance matrix. That is, it assumes that an observation from the
k’th class is of the form , where is a covariance matrix for the k’th
class.
Quadratic Discriminant Analysis (QDA)
• Under this assumption, the Bayes classifier assigns an observation X = x
to the class for which

is largest.
𝑒 : 𝜋 𝐾 𝑖𝑠 𝑒𝑞𝑢𝑖𝑣𝑎𝑙𝑒𝑛𝑡 𝑡𝑜 𝑃 ( 𝐶 𝐾 )

• So the QDA classifier involves plugging estimates for , , and , and then
assigning an observation X = x to the class for which this quantity is
largest.

• Unlike in LDA, in QDA the quantity x appears as a quadratic function


of . This is why it is called QDA.
LDA Vs. QDA

Q. Why does it matter whether or not we assume that the K classes share
a common covariance matrix? In other words, why would one prefer LDA
to QDA, or vice-versa?

• The answer lies in the bias-variance trade-off.

• When there are p predictors, then estimating a covariance matrix


requires estimating p*(p+1)/2 parameters in case od LDA. QDA
estimates a separate covariance matrix for each class, for a total of
K*p*(p+1)/2 parameters.

• With 50 predictors this is some multiple of 1,275 in case of QDA, which


is a lot of parameters as compared to LDA
LDA Vs. QDA
• Instead of this, by assuming that the K classes share a common covariance
matrix, the LDA model becomes linear in x, which means there are linear
coefficients to estimate.

• Consequently, LDA is a much less flexible classifier than QDA, and so has
substantially lower variance. This can potentially lead to improved prediction
performance.

• But there is a trade-off: Since LDA assume that the K classes share a common
covariance matrix, it gives poor performance because LDA can suffer from high
bias.

• LDA tends to be a better than QDA if there are relatively few training
observations and so reducing variance is crucial.

• In contrast, QDA is recommended if the training set is very large, so that the
variance of the classifier is not a major concern.
LDA Vs. QDA
• Figure illustrates the performances of LDA and QDA in two scenarios. In the left-
hand panel, the two Gaussian classes have a common correlation of 0.7 between
X1 and X2. As a result, the Bayes decision boundary is linear and is accurately
approximated by the LDA decision boundary. The QDA decision boundary is
inferior, because it suffers from higher variance without a corresponding decrease
in bias.

• In contrast, the right-hand panel displays a situation in which the orange class has
a correlation of 0.7 between the variables and the blue class has a correlation of
−0.7. Now the Bayes decision boundary is quadratic, and so QDA more accurately
approximates this Bayes decision boundary than does LDA.
K-nearest neighbor (KNN)
• k-nearest neighbors algorithm (k-NN) is a non-parametric
classification method that is used for both classification and
regression.

• Simple, but a very powerful classification algorithm

• Classifies based on a similarity measure

• Whenever we have a new data to classify, we find its K nearest


neighbors from the training data

• All instances correspond to points in an n-dimensional Euclidean


space
K-nearest neighbor (KNN)
The algorithm’s learning is:

1. Instance-based learning: Here we do not learn weights from


training data to predict output (as in model-based algorithms)
but use entire training instances to predict output for unseen
data.

2. Lazy Learning: Model is not learned using training data prior and
the learning process is postponed to a time when prediction is
requested on the new instance.

3. Non -Parametric: In KNN, there is no predefined form of the


mapping function.
K-nearest neighbor (KNN)
How does KNN works?

• Let us say we have plotted data points from our training set on
2D feature space. Say, we have a total of 6 data points (3 red
and 3 blue).

• Red data points belong to ‘class1’ and blue data points belong to
‘class2’.
• Yellow data point in a feature space represents the new point for
which a class is to be predicted. Obviously, we say it belongs to
‘class1’ (red points)

• Why?
Because its nearest neighbors belong to that class!

This is the principle behind K Nearest Neighbors!


K-nearest neighbor (KNN)

• Here, nearest neighbors are those data points that have


minimum distance in feature space from the new data point
(test data).

• K is the number of such data points we consider in our


implementation of the algorithm.
1-Nearest Neighbor
3-Nearest Neighbor
K-nearest neighbor (KNN)

• The distance metric and K value are two important considerations


while using the KNN algorithm.

• Euclidean distance is the most popular distance metric. One can


also use Hamming distance, Manhattan distance, Minkowski
distance as per the need.

• For predicting class/ continuous value for a new data point, it


considers all the data points in the training dataset. Find ‘K’ Nearest
Neighbors (Data points) of a new data point from feature space and
the corresponding class labels.
K-nearest neighbor (KNN)

Then:

(i) For classification: the output is a class membership. An object is


classified by a plurality vote of its neighbors, with the object
being assigned to the class most common among its k nearest
neighbors (k is a positive integer, typically small). If k = 1, then
the object is simply assigned to the class of that single nearest
neighbor.

(ii) For regression: the output is the property value for the object.
For example, this value can be the mean or median of the values
of k nearest neighbors.
KNN classification approach
• Classified by “MAJORITY VOTES” for its neighbor classes .

• Assigned to the most common class amongst its K-nearest


neighbors (by measuring “distance” between data)
KNN algorithm
1. Load the data

2. Initialize K to your chosen number of neighbors

3. For each example in the data


3.1 Calculate the distance between the query example and the current
example from the data.
3.2 Add the distance and the index of the example to an ordered collection

4. Sort the ordered collection of distances and indices from smallest to largest (in
ascending order) by the distances

5. Pick the first K entries from the sorted collection

6. Get the labels of the selected K entries

7. If regression, return the mean of the K labels

8. If classification, return the mode of the K labels


Example of k-NN classification. The test sample (green dot) should be
classified either to blue squares or to red triangles.
 If k = 3 (solid line circle) it is assigned to the red triangles because there
are 2 triangles and only 1 square inside the inner circle.
 If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares
vs. 2 triangles inside the outer circle).
 Choosing the right value for K

To select the K that’s right for your data, we run the KNN algorithm
several times with different values of K and choose the K that reduces
the number of errors we encounter while maintaining the algorithm’s
ability to accurately make predictions on test data.
 Choosing the right value for K

K is a crucial parameter in the KNN algorithm. Some suggestions for


choosing K Value are:

1. Using error curves: The figure below shows error curves for different
values of K for training and test data.
 Choosing the right value for K
1. Using error curves: At low K values, there is overfitting of data/high variance.
Therefore test error is high and train error is low. At K=1 in train data, the error is
always zero, because the nearest neighbor to that point is that point itself.
Therefore though training error is low test error is high at lower K values. This is
called overfitting. As we increase the value for K, the test error is reduced.
 Choosing the right value for K

1. Using error curves:

But after a certain K value, bias/ underfitting is introduced and test error goes high.
So we can say initially test data error is high(due to variance) then it goes low and
stabilizes and with further increase in K value, it again increases(due to bias). The K
value when test error stabilizes and is low is considered as optimal value for K.
• From the above error curve we can choose K=8 for our KNN algorithm
implementation.
 Choosing the right value for K

1. Using error curves

2. Also, domain knowledge is very useful in choosing the K value.

3. It is seen empirically that K value should be odd while considering


binary(two-class) classification
A Comparison of Classification Methods

1.Logistic Regression
2.LDA
3.QDA
4.KNN
A Comparison of Classification Methods

• Though motivations differ, the logistic regression and LDA methods are
closely connected.

• Consider the two-class setting with p = 1 predictor, and let p1(x) and p2(x)
= 1−p1(x) be the probabilities that the observation X = x belongs to class 1
and class 2, respectively.

• In the LDA framework log odds is given by

where c0 and c1 are functions of μ1, μ2, and σ2.


A Comparison of Classification Methods

• we know that in logistic regression,

• Both equations for LDA and LR are linear functions of x.

• Hence, both logistic regression and LDA produce linear decision boundaries.

• The only difference between the two approaches lies in the fact that β0 and
β1 are estimated using maximum likelihood, whereas c0 and c1 are
computed using the estimated mean and variance from a normal
distribution.

• This same connection between LDA and logistic regression also holds for
multidimensional data with p > 1.
A Comparison of Classification Methods

• Since logistic regression and LDA differ only in their fitting


procedures, one might expect the two approaches to give similar
results. This is often, but not always, the case.

• LDA assumes that the observations are drawn from a Gaussian


distribution with a common covariance matrix in each class, and so
can provide some improvements over logistic regression.

• Conversely, logistic regression can outperform LDA if these Gaussian


assumptions are not met.
A Comparison of Classification Methods

• KNN takes a completely different approach from the classifiers like LR,
LDA and QDA. In order to make a prediction for an observation X = x,
the K training observations that are closest to x are identified. Then X
is assigned to the class to which the plurality of these observations
belong.

• Hence KNN is a completely non-parametric approach: no


assumptions are made about the shape of the decision boundary.

• Therefore, we can expect KNN approach to dominate LDA and logistic


regression when the decision boundary is highly non-linear.

• On the other hand, KNN does not tell us which predictors are
important
A Comparison of Classification Methods

• KNN takes a completely different approach from the classifiers like LR,
LDA and QDA. In order to make a prediction for an observation X = x,
the K training observations that are closest to x are identified. Then X
is assigned to the class to which the plurality of these observations
belong.

• Hence KNN is a completely non-parametric approach: no


assumptions are made about the shape of the decision boundary.

• Therefore, we can expect this approach to dominate LDA and logistic


regression when the decision boundary is highly non-linear.

• On the other hand, KNN does not tell us which predictors are
important. It will not able to estimate parameter’s values.
A Comparison of Classification Methods

• QDA serves as a compromise between the non-parametric KNN


method and the linear LDA and logistic regression approaches.

• Since QDA assumes a quadratic decision boundary, it can accurately


model a wider range of problems than the linear methods.

• Though not as flexible as KNN, QDA can perform better in the


presence of a limited number of training observations because it does
make some assumptions about the form of the decision boundary.
Assignment-1 (10 marks)

Download stock market data or caravan insurance data


(from kaggle) and perform Linear regression, Logistic
regression, LDA, QDA and KNN. Also compare the results of
different methodologies

Last date of submission: October 3, 2021 (submit in google


classroom)

Format of report:
1. Title,
2. Code,
3. Result (performance measures),
4. comparison graph
The dashed lines are the Bayes decision boundaries. In other words, they
represent the set of values x for which ; i.e.

𝑓𝑜𝑟 ,𝑘 ≠ 𝑙

The log term from has disappeared because each of the three classes has the
same number of training observations; i.e. is the same for each class.

You might also like