ML Module 2
ML Module 2
Reference:
• James G., Witten D., Hastie T., Tibshirani R. “An Introduction to
Statistical Learning with Applications in R”, Springer Texts in
Statistics.
• Machine learning by Andrew Ng.
Linear Regression
Linear Regression
Goal: To find the straight line that best fit the data
Goal: To find the straight line that best fit the data
X
Linear regression with one variable
Linear regression with one variable
𝒉𝜽 ( 𝒙 )=𝜽 𝟎+ 𝜽 𝟏 𝒙
(y)
(x)
Linear regression with one variable
Linear regression with one variable
h(x) = 0.5*x + 1
h(x) = 0.5*x + 0
h(x) = 0*x + 1.5
Linear regression with one variable
𝑚
1
𝐽 ( 𝜃 0 , 𝜃1 ) = ∑ (h𝜃 ( 𝑥 ) − ¿ 𝑦 ) ¿
(𝑖) (𝑖) 2
2𝑚 𝑖=1
𝒉𝜽 ( 𝒙)
Cost Function
𝒉𝜽 ( 𝒙)
Linear regression with one variable:
Cost function
𝜃 1=1
𝑚
1 𝐽 ( 1 ) =0
𝜃 0 , 𝜃1 ) = ∑
2 𝑚 𝑖=1
(𝑖) (𝑖) 2
(h𝜃 ( 𝑥 ) − ¿ 𝑦 ) =0 ¿
𝐽 ( 0.5 ) =?
Linear regression with one variable:
Cost function
𝜃 1= 0
𝜃 1=1
𝜃 1= 0.5
𝜃 1= 0
𝜃 1=1
𝐽 ( 0.5 )= 2.3
Linear regression with one variable:
Cost function
𝜃 1=1
𝜽𝟏
𝜃0 =50 ; 𝜃 1=0.06 • When we have single parameter
, the cost function is a bowl
shaped.
Contour Plot
J
Linear regression with one variable:
Cost function
J
Linear regression with one variable:
Cost function
h(x)
-0.15
800
h(x)
h(x)
Linear regression with one variable:
Cost function
h(x)
Cost function:
Gradient descent algorithm
}
Start with some value of
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Local Minima
Start with some value of
Start with some value of
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Local Minima
Converges to local minima…
Derivative term
𝑱 ( 𝜽𝟏 )
• Gradient descent can converge to a local minimum, even with the
learning rate α fixed
𝜹
𝑱 (𝜽 𝟎 , 𝜽𝟏 )
𝜹 𝜽𝟎
𝜹
𝑱 ( 𝜽 𝟎 , 𝜽𝟏 )
𝜹 𝜽𝟏
Update simultaneously
Local Minima
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Converges to local minima…
Multivariate Linear
Regression
Linear regression with multiple variable
(multiple features): Multiple linear
regression
Linear Regression with multiple variables
Linear Regression with multiple variables
Y
X1 X2 X3 X4
# m = training
example
𝟏 𝟓𝟑𝟒
𝑿 ( 𝟑) =⌈ 𝟑 ⌉∈𝑹
n=4 𝟐
𝟑𝟎
=?
Linear Regression with multiple variables
+ ++…+
Example:
+ 3* + 0.01* + … + 2*
X θ
+ +…+
=
𝑇
𝜃 =[ 𝜃 1 𝜃 2 . . . 𝜃 𝑛 ]
Gradient descent for multiple variables
𝒙 𝟎 =𝟏
𝑱 (𝜽)
𝑱 (𝜽)
𝑱 (𝜽)
Gradient descent for multiple variables
𝜹
𝑱 (𝜽
𝜹𝜽
𝜹
𝑱 ( 𝜽)
𝜹 𝜽𝟎
𝑱 ( 𝜽 ) 𝒔𝒉𝒐𝒖𝒍𝒅 𝒅𝒆𝒄𝒓𝒆𝒔𝒆 𝒂𝒇𝒕𝒆𝒓 𝒆𝒗𝒆𝒓𝒚 𝒊𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏
θ
θ
𝑱 (𝜽)
• For sufficiently small α, J(θ) should decrease on every iteration.
• If α is too large J(θ) may not decrease on each iteration; may not converge.
• Choose the value of α carefully (choose α = 0.001, 0.003, 0.01, 0.03, 0.3, 0.5)
Non-linear regression
(Polynomial regression)
House Price Prediction
+ +
𝑿𝟏 𝑿𝟐
Include
=
h 𝜃 ( 𝑋 ) = 𝜃 0 + 𝜃 1 𝑋 1 + 𝜃 2 𝑋 2+ 𝜃 3 𝑋 3
Polynomial features:
Regularization
Bias & Variance
Logistic regression
Logistic Regression
• Logistic regression: A method that is used to predict qualitative
responses (known as classification).
Logistic Regression
Linear discriminant analysis (LDA)
K-nearest neighbor (KNN)
Logistic Regression
Logistic Regression
Classification
boundary
2 examples of malignant tumor are misclassified
Classification
boundary
2 examples of malignant tumor are misclassified
Classification
boundary
Want
1
h 𝜃 ( 𝑋 )= −𝜃
𝑇
𝑋
1 +𝑒
𝑇
𝜃 𝑋
𝑒
h𝜃 ( 𝑥) = 𝜃
𝑇
𝑋
1+ 𝑒
Logistic regression model
Logistic regression: Decision boundary
Logistic regression: Decision boundary
Logistic regression: Decision boundary
Y=1
Y=0
Logistic regression: Decision boundary
Nonlinear function
𝑇
𝑒 𝑋
𝜃
h𝜃 ( 𝑥) = 𝜃 𝑋
𝑇
1+ 𝑒
Logistic regression: Cost function
𝑇
𝜃 𝑋
𝑒
h𝜃 ( 𝑥) = 𝑇
1+ 𝑒 𝜃 𝑋
h𝜃( 𝑥) 𝜃
𝑇
𝑋
=𝑒
1 − h𝜃 ( 𝑥 )
ln
The left-hand side is called the log-odds or logit. We see that the
logistic regression model has a logit that is linear in X.
Logistic regression: Cost function
Linear Regression:
Logistic Regression:
WE WANT
Cost
function
Logistic regression: Cost function
Cost
function
Logistic regression: Cost function
Putting together
Logistic regression: Cost function
Get θ
Logistic regression: Cost function
Logistic regression:
Multi-class classification (one-Vs.-all)
1. When the classes are well-separated, the parameter estimates for the
logistic regression model are surprisingly unstable. LR has the peculiar
behavior that if a feature separates the classes perfectly, the
coefficients go to infinity. LR works better when the classes are not
well separable. LDA does not suffer from this problem.
Therefore, 𝑃 ( 𝐶 𝐾 ) 𝑃 ( 𝑋 ∨𝐶 𝐾 )
P ( 𝐶 𝐾| 𝑋 ) = 𝑛
∑ 𝑃 ( 𝑋 |𝐶 𝐾 ) 𝑃 ( 𝐶 𝐾 )
𝑘=1
Linear Discriminant Analysis (LDA)
Bayes’ Classifier:
• The Bayes theorem states that
𝑃 ( 𝐶 𝐾 ) 𝑃 ( 𝑋 ∨𝐶 𝐾 )
P ( 𝐶 𝐾| 𝑋 ) = 𝑛
∑ 𝑃 ( 𝑋 |𝐶 𝐾 ) 𝑃 ( 𝐶 𝐾 )
𝑘=1
• This means
Expression of
probability density
function (PDF) of
Naïve Bayes classifier
Linear Discriminant Analysis (LDA)
Naïve Bayes’ Classifier:
Expression of
probability density
function (PDF) of
Naïve Bayes classifier
( )
2
1 𝑥 −𝜇 𝐾
−
1 2 𝜎𝐾
𝑃 ( 𝑥𝑖|𝐶 𝐾 )= 𝑒
𝜎𝐾 √ 2 𝜋
where and are the mean and variance parameters for the class.
For now, let us further assume that : that is, there is a shared variance
term across all K classes, which for simplicity we can denote by .
LDA for p = 1
( )
2
1 𝑥 −𝜇 𝐾
−
1 2 𝜎𝐾
𝑃 ( 𝑥𝑖|𝐶 𝐾 )=
Substituting the equation 𝑒
𝜎𝐾 √ 2 𝜋
in the equation of Bayes’ theorem (stated earlier) we get,
( )
2
1 𝑥− 𝜇𝐾
−
1 2 𝜎𝐾
𝑃 ( 𝐶𝐾 ) 𝑒
𝜎 𝐾 √2 𝜋
P ( 𝐶 𝐾| 𝑋 ) =
( )
2
𝑛 1 𝑥 − 𝜇𝐾
−
1
∑ 𝑃 (𝐶 𝐾 ) 𝜎 √ 2 𝜋 𝑒
2 𝜎𝐾
𝑖 =1 𝐾
𝑖 =1
• Taking the log of the above eqn. and rearranging the terms, it can be
shown that this is equivalent to assigning the observation to the class
for which is largest
LDA for p = 1
( )
2
1 𝑥 − 𝜇𝐾
−
1 2 𝜎𝐾
𝑃 ( 𝐶𝐾 ) 𝑒
𝜎𝐾 √2 𝜋
P ( 𝐶 𝐾| 𝑋 ) =
( )
2
1 𝑥 − 𝜇𝐾
𝑛 −
1
∑ 𝑃 (𝐶 𝐾 )
𝜎𝐾 √2 𝜋
𝑒
2 𝜎𝐾
𝑖 =1
• Taking the log of the above eqn. and rearranging the terms, it can be
shown that this is equivalent to assigning the observation to the class
for which is largest
• Generally for 2-classes, we look into the log ratio of their posterior
probability:
LDA for p = 1
(Prove it !...)
LDA for p = 1
• The mean and variance parameters for the two density functions are as
follows:
• The two densities overlap, and so given that X = x, there is some uncertainty
about the class to which the observation belongs.
If we assume that an observation is equally likely to come from either
class—that is, then from eqn. , we see that the Bayes classifier assigns the
observation to class 1 if x < 0 and class 2 otherwise.
LDA for p = 1
• In practice, even if we are quite certain of our assumption that X is drawn
from a Gaussian distribution within each class, we still have to estimate the
parameters .
• The LDA approximates the Bayes classifier by using the following estimates:
𝑛𝐾
𝑃 (𝐶 𝐾 )=
𝑛
• Sometimes we have knowledge of the class membership probabilities , which
can be used directly. In the absence of any additional information, LDA
estimates using the proportion of the training observations that belong to the
k’th class.
LDA for p = 1
𝑛𝐾
𝑃 (𝐶 𝐾 )=
𝑛
^ ^ 𝟐
𝝁 𝝁𝑲
+ 𝐥 𝒏 [ 𝑷 ( 𝑪 𝑲 ) ]= ^
𝑲
− 𝜹𝑲 ( 𝒙 )
^
𝝈 𝟐 ^
𝟐𝝈 𝟐
• The word ‘linear’ in the classifier’s name stems from the fact
that the discriminant functions in the above are linear functions
of x.
• Summary (p = 1):
The LDA classifier results from assuming that the observations
within each class come from a normal distribution with a class-specific
mean vector and a common variance , and plugging estimates for these
parameters into the Bayes classifier.
LDA for p > 1
• We now extend the LDA classifier to the case of multiple
predictors.
• However, the bell shape will be distorted if the predictors are correlated or
have unequal variances, as is illustrated in the right-hand panel of Figure. In this
situation, the base of the bell will have an elliptical, rather than circular shape
Multivariate Gaussian distribution
• To indicate that a p-dimensional random variable X has a multivariate
Gaussian distribution, we write X ∼ N(μ, Σ). Here E(X) = μ is the mean of
X (a vector with p components), and Cov(X) = Σ is the p × p covariance
matrix of X.
Covariance Matrix
In the top row, all bivariate Gaussian distributions have ρ=0 and look like a circle for
standard deviations of equal size. The top middle plot is stretched along X2, giving it
an elliptical shape. The middle and last row show how the distribution changes for
negative (ρ=−0.3) and positive (ρ=0.7) correlations
LDA for p > 1
• In the case of p > 1 predictors, the LDA classifier assumes that the
observations in the k’th class are drawn from a multivariate Gaussian
distribution N(,Σ), where is a class-specific mean vector, and Σ is a
covariance matrix that is common to all K classes.
• Plugging the density function for the kth class and performing a
little bit of algebra reveals that the Bayes classifier assigns an
observation X = x to the class for which
𝑇 −1 1 𝑇 −1
𝛿𝐾 𝑋 = 𝑋 Σ 𝜇 𝐾 − 𝜇 𝐾 Σ 𝜇 𝐾 + ln [ P ( 𝐶 𝐾 ) ]
( )
2
is largest.
The Bayes classifier will classify an observation according to the region in which it
is located.
LDA for p > 1
• Once again, we need to estimate the unknown parameters and Σ; the
formulas are similar to those used in the 1D case (LDA with p = 1).
𝑇 −1 1 𝑇 −1
𝛿𝐾 𝑋 = 𝑋 Σ 𝜇 𝐾 − 𝜇 𝐾 Σ 𝜇 𝐾 + ln [ P ( 𝐶 𝐾 ) ]
( )
2 function of x; i.e. the LDA
• Note that in the above eqn. is a linear
decision rule depends on x only through a linear combination of its
elements. This is the reason for the word linear in LDA.
20 observations drawn from each of the three classes are displayed, and the
resulting LDA decision boundaries are shown as solid black lines. Overall, the LDA
decision boundaries are pretty close to the Bayes decision boundaries, shown
again as dashed lines. The test error rates for the Bayes and LDA classifiers are
0.0746 and 0.0770, respectively. This indicates that LDA is performing well on this
data.
The ROC (receiver operating
characteristics) curve is a popular
graphic for simultaneously displaying
the ROC curve two types of errors for
all possible thresholds.
ROC Curves
• The overall performance of a classifier is given by the area under the ROC
curve or AUC.
• Along x-axis true positive rate (TPR) and along y-axis false positive rate
(FPR) are plotted. These are also called the sensitivity (plotted along Y-axis)
and 1-specificity (plotted along X-axis) of our classifier.
• An ideal ROC curve will touch the top left corner, so the larger the AUC the
better the classifier.
• For this data the AUC is 0.95, which is close to the maximum of 1, so would
be considered very good. We expect a classifier that performs better than
chance to have an AUC of 0.5 (when evaluated on an independent test set
not used in model training).
Like LDA, the QDA classifier results from assuming that the
observations from each class are drawn from a Gaussian distribution,
and plugging estimates for the parameters into Bayes’ theorem in
order to perform prediction.
Unlike LDA, QDA assumes that each class has its own
covariance matrix. That is, it assumes that an observation from the
k’th class is of the form , where is a covariance matrix for the k’th
class.
Quadratic Discriminant Analysis (QDA)
• Under this assumption, the Bayes classifier assigns an observation X = x
to the class for which
is largest.
𝑒 : 𝜋 𝐾 𝑖𝑠 𝑒𝑞𝑢𝑖𝑣𝑎𝑙𝑒𝑛𝑡 𝑡𝑜 𝑃 ( 𝐶 𝐾 )
• So the QDA classifier involves plugging estimates for , , and , and then
assigning an observation X = x to the class for which this quantity is
largest.
Q. Why does it matter whether or not we assume that the K classes share
a common covariance matrix? In other words, why would one prefer LDA
to QDA, or vice-versa?
• Consequently, LDA is a much less flexible classifier than QDA, and so has
substantially lower variance. This can potentially lead to improved prediction
performance.
• But there is a trade-off: Since LDA assume that the K classes share a common
covariance matrix, it gives poor performance because LDA can suffer from high
bias.
• LDA tends to be a better than QDA if there are relatively few training
observations and so reducing variance is crucial.
• In contrast, QDA is recommended if the training set is very large, so that the
variance of the classifier is not a major concern.
LDA Vs. QDA
• Figure illustrates the performances of LDA and QDA in two scenarios. In the left-
hand panel, the two Gaussian classes have a common correlation of 0.7 between
X1 and X2. As a result, the Bayes decision boundary is linear and is accurately
approximated by the LDA decision boundary. The QDA decision boundary is
inferior, because it suffers from higher variance without a corresponding decrease
in bias.
• In contrast, the right-hand panel displays a situation in which the orange class has
a correlation of 0.7 between the variables and the blue class has a correlation of
−0.7. Now the Bayes decision boundary is quadratic, and so QDA more accurately
approximates this Bayes decision boundary than does LDA.
K-nearest neighbor (KNN)
• k-nearest neighbors algorithm (k-NN) is a non-parametric
classification method that is used for both classification and
regression.
2. Lazy Learning: Model is not learned using training data prior and
the learning process is postponed to a time when prediction is
requested on the new instance.
• Let us say we have plotted data points from our training set on
2D feature space. Say, we have a total of 6 data points (3 red
and 3 blue).
• Red data points belong to ‘class1’ and blue data points belong to
‘class2’.
• Yellow data point in a feature space represents the new point for
which a class is to be predicted. Obviously, we say it belongs to
‘class1’ (red points)
• Why?
Because its nearest neighbors belong to that class!
Then:
(ii) For regression: the output is the property value for the object.
For example, this value can be the mean or median of the values
of k nearest neighbors.
KNN classification approach
• Classified by “MAJORITY VOTES” for its neighbor classes .
4. Sort the ordered collection of distances and indices from smallest to largest (in
ascending order) by the distances
To select the K that’s right for your data, we run the KNN algorithm
several times with different values of K and choose the K that reduces
the number of errors we encounter while maintaining the algorithm’s
ability to accurately make predictions on test data.
Choosing the right value for K
1. Using error curves: The figure below shows error curves for different
values of K for training and test data.
Choosing the right value for K
1. Using error curves: At low K values, there is overfitting of data/high variance.
Therefore test error is high and train error is low. At K=1 in train data, the error is
always zero, because the nearest neighbor to that point is that point itself.
Therefore though training error is low test error is high at lower K values. This is
called overfitting. As we increase the value for K, the test error is reduced.
Choosing the right value for K
But after a certain K value, bias/ underfitting is introduced and test error goes high.
So we can say initially test data error is high(due to variance) then it goes low and
stabilizes and with further increase in K value, it again increases(due to bias). The K
value when test error stabilizes and is low is considered as optimal value for K.
• From the above error curve we can choose K=8 for our KNN algorithm
implementation.
Choosing the right value for K
1.Logistic Regression
2.LDA
3.QDA
4.KNN
A Comparison of Classification Methods
• Though motivations differ, the logistic regression and LDA methods are
closely connected.
• Consider the two-class setting with p = 1 predictor, and let p1(x) and p2(x)
= 1−p1(x) be the probabilities that the observation X = x belongs to class 1
and class 2, respectively.
• Hence, both logistic regression and LDA produce linear decision boundaries.
• The only difference between the two approaches lies in the fact that β0 and
β1 are estimated using maximum likelihood, whereas c0 and c1 are
computed using the estimated mean and variance from a normal
distribution.
• This same connection between LDA and logistic regression also holds for
multidimensional data with p > 1.
A Comparison of Classification Methods
• KNN takes a completely different approach from the classifiers like LR,
LDA and QDA. In order to make a prediction for an observation X = x,
the K training observations that are closest to x are identified. Then X
is assigned to the class to which the plurality of these observations
belong.
• On the other hand, KNN does not tell us which predictors are
important
A Comparison of Classification Methods
• KNN takes a completely different approach from the classifiers like LR,
LDA and QDA. In order to make a prediction for an observation X = x,
the K training observations that are closest to x are identified. Then X
is assigned to the class to which the plurality of these observations
belong.
• On the other hand, KNN does not tell us which predictors are
important. It will not able to estimate parameter’s values.
A Comparison of Classification Methods
Format of report:
1. Title,
2. Code,
3. Result (performance measures),
4. comparison graph
The dashed lines are the Bayes decision boundaries. In other words, they
represent the set of values x for which ; i.e.
𝑓𝑜𝑟 ,𝑘 ≠ 𝑙
The log term from has disappeared because each of the three classes has the
same number of training observations; i.e. is the same for each class.