Ch2Regression and Regularization1
Ch2Regression and Regularization1
3/11/2024 1
Normal Equation
❖ Here we will see normal equation which for some linear regression
problems will give us much better way to solve the optimum value of
parameter θ.
❖ Gradient descent needs a number of iterations to reach the optimum value
where as normal equation which is analytical method it takes one step to get
optimum value.
❖ We know how to determine the minimum value from our calculus and here
also the same principle will be applied.
𝑚
1 2
𝐽 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛 = ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖)
2𝑚
𝑖=1
𝜕
𝐽 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛 = 0 𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑗 𝑎𝑛𝑑
𝜕𝜃𝑗
𝑠𝑜𝑙𝑣𝑒 𝑓𝑜𝑟 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛
3/11/2024 2
Normal Equation
❖ Example m=4
❖ To apply normal equation take these data sets and add on extra column for
x0 and then we will find:
3/11/2024 3
Normal Equation
❖ Next construct a matrix X which contains all features and a vector y form
outputs.
= (XTX)-1XTy
3/11/2024 4
Normal Equation
❖ To generalize for m number of data set (𝑥 1 , 𝑦1 ), … , (𝑥 𝑚 , 𝑦 𝑚 ) and n
number of features
3/11/2024 5
Normal Equation
❖ Then matrix X called the design matrix will be:
(𝑥 (1) )𝑇
(𝑥 (2) )𝑇
𝑋 = (𝑥 (3) )𝑇
…
(𝑥 𝑚 )𝑇
❖ And thus after setting each we can evaluate the following equation.
= (XTX)-1XTy
❖ Inverse and transpose of a matrix can be implemented on matlab/octave and
it is as follows:
❖ We used feature scaling for gradient descent method but not necessary in
normal equation methods.
3/11/2024 6
Normal Equation
❖ Lets see advantages and disadvantages of gradient descent and normal
equation methods.
❖ For m training examples, n features.
3/11/2024 7
Normal Equation
❖ Normal Equation and Non-invertibility
3/11/2024 8
Logistic Regression
❖ If we want to predict employees salary increment based on their
performance, we can use linear regression.
❖ Again if we want to know whether an employee would get a promotion or
not, in this case there has to be threshold values to decide whether an
employee will get a promotion or not.
❖ Takes a probabilistic approach to learning discriminative functions (i.e., a
classifier)
❖ Instead of just predicting the class, give the probability of the instance
being that class, i.e., learn p(y | x).
3/11/2024 9
Logistic Regression
❖ Comparison to perceptron:
❖ Perception doesn’t produce probability estimate
❖ Perception are only interested in producing a discriminative model
❖ We know that 0 < p(event) <1
❖ h(x) should give p(y = 1 | x; )
❖ Logistic regression model want 0 ≤ ℎ𝜃 𝑥 ≤ 1.
3/11/2024 10
Hypothesis Representation
❖ Logistic regression model is define ℎ𝜃 𝑥 𝑎𝑠:
ℎ𝜃 𝑥 = 𝑔(𝜃 𝑇 𝑥)
❖ Where
1
𝑔(𝑧) =
1 + 𝑒 −𝑧
❖ And this is called the sigmoid function/ logistic function. That is why it is
called logistic regression.
❖ Then putting together ℎ𝜃 𝑥 becomes:
1
ℎ𝜃 (𝑥) = 𝑇𝑥
1 + 𝑒 −𝜃
3/11/2024 11
Logistic Regression
❖ If we plot the sigmoid function g(z), it looks like as given below.
Logistic / Sigmoid Function
g(z)
❖ Here we can see that g(z) is between 0 and 1 the same is true for ℎ𝜃 𝑥 .
❖ Given ℎ𝜃 𝑥 what we need to do is fit the parameters ℎ𝜃 𝑥 to our data or
predicting value of parameters theta for given data set.
3/11/2024 12
Interpretation of Hypothesis Output- ℎ𝜃(𝑥)
❖ ℎ𝜃 𝑥 = estimated probability that y = 1 on input x
❖ More formally
ℎ𝜃 (𝑥) = 𝑃(𝑦 = 1|𝑥; 𝜃)
❖ And as we know this means that “probability that y = 1, given x,
parameterized by θ”.
❖ y has two possible values that is either 0 or 1.
𝑃 𝑦 = 0 𝑥; 𝜃 + 𝑃 𝑦 = 1 𝑥; 𝜃 = 1
𝑃 𝑦 = 0 𝑥; 𝜃 = 1 − 𝑃 𝑦 = 1 𝑥; 𝜃
3/11/2024 13
Decision Boundary
❖ If we want to predict as y = 1 and y = 0 here is how we can deal.
❖ predict y = 1 if ℎ𝜃 𝑥 ≥ 0.5
❖ Predict y = 0 if ℎ𝜃 𝑥 ˂ 0.5
❖ When we look at the plot of sigmoid function g(z) ≥ 0.5 when z ≥ 0.
❖ Similarly ℎ𝜃 𝑥 = 𝑔 𝜃 𝑇 𝑥 ≥ 0.5 𝑤ℎ𝑒𝑛𝑒𝑣𝑒𝑟 𝜃 𝑇 𝑥 ≥ 0 and ℎ𝜃 𝑥 =
𝑔 𝜃 𝑇 𝑥 < 0.5 𝑤ℎ𝑒𝑛𝑒𝑣𝑒𝑟 𝜃 𝑇 𝑥 < 0
❖ Thus we predict y = 1 whenever 𝜃 𝑇 𝑥 ≥ 0 and y = 0 whenever 𝜃 𝑇 𝑥 <
0.
3/11/2024 14
Decision Boundary
❖ Lets consider a training set and hypothesis as shown below.
3/11/2024 15
Decision Boundary
❖ And such line which divides the data set is known as decision boundary.
❖ Lets consider another example with non-linear decision boundary as
shown below.
3/11/2024 16
Decision Boundary
❖ For this example we can have hypothesis as given below
❖ ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥12 + 𝜃4 𝑥22
❖ If we choose 𝜃0 = 1, 𝜃1 = 0, 𝜃2 = 0, 𝜃3 = 1 𝑎𝑛𝑑 𝜃4 =
1, 𝑡ℎ𝑒𝑛 𝑤𝑒 𝑐𝑎𝑛 𝑑𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑎𝑠 𝑓𝑜𝑙𝑙𝑜𝑤𝑠.
❖ Then if we sub 𝑥12 + 𝑥22 =1, then this gives a circle which is a decision
boundary.
Decision Boundary
3/11/2024 17
Decision Boundary
❖ Even we can have more non-linear decision boundary like as shown below and
for such case ℎ𝜃 𝑥 will be a function of higher order polynomial.
3/11/2024 18
Decision Boundary
❖ Example: given the following dataset answer the following questions using
regression as classifier.
Study Hours Pass (1) / Fail
❖ Calculate the probability of pass for (0)
the student who studied 33 hours. 28 0
❖ At least how many hours student
15 0
should study that makes him to pass
the course with the probability of 33 1
more than 94%. 27 1
❖ Use the model 38 1
3/11/2024 19
Cost Function
❖ Given training set {(x(1),y(1)), (x(2),y(2)), (x(3),y(3)),…, (x(m),y(m))},
where xT [x0 x1…xn], x0 = 1, y {0,1} and T = [0 1 … n]
1
ℎ𝜃 (𝑥) = 𝑇
1 + 𝑒 −𝜃 𝑥
❖Then, how to choose parameters ?
❖Logistic regression cost function
3/11/2024 20
Cost Function
❖ For our understanding lets plot this cost function for y = 1 given as follows.
3/11/2024 21
Cost Function
❖ Again for our understanding lets plot the cost function for y = 0 shown
below.
3/11/2024 24
Gradient Descent
❖ Algorithms looks identical to linear regression except the difference in expression
of ℎ𝜃 𝑥 .
❖ Previously we discussed how to make sure gradient descent for linear regression
converge correctly. The same method can be applied to make sure gradient descent
for logistic regression.
❖ In the gradient descent algorithm implementation of logistic regression we will
have n θ that must be updated simultaneously and thus for loop is needed.
❖ The idea of feature scaling we applied for linear regression can be also used in
logistic regression for faster convergence.
❖ Other than gradient descent we can have advanced optimization algorithms
which can be used to run logistic regression much more quickly and also enable the
algorithm to scale up if we have large number of features.
3/11/2024 25
Gradient Descent
3/11/2024 26
Gradient Descent
❖ Advantages of Advanced Algorithms over gradient algorithms include:
❖ No need manually pick α
❖ Often faster than gradient descent
❖ Disadvantage
❖ More complex
3/11/2024 27
Multi-class Classification: One-vs-all
❖ Here we will see multi-class classification problem particularly an algorithm
called one-vs-all classification.
❖ What is a multi-class classification problem? Here are some examples.
❖ Email foldering/tagging: Work, Friends, Family, Hobby
❖ Medical Diagrams: Not ill, Cold, Flu
❖ Weather: Sunny, Cloudy, Rain, Snow
❖ Data set for binary and multi-class classification are given below just for
comparison.
3/11/2024 28
Logistic Regression
Binary classification Multiclass classification
𝑥2 𝑥2
𝑥1 𝑥1
❖ And classifying of the multi-class can be done applying the one-vs-all also called
one-vs-rest as follows.
3/11/2024 29
One-vs-all (one-vs-rest)
ℎ𝜃
1
𝑥
𝑥2
𝑥2
𝑥1
𝑥1 ℎ𝜃
2
𝑥 𝑥2
Class 1: 𝑥1
Class 2:
Class 3: 3
ℎ𝜃 𝑥
𝑥2
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1
3/11/2024 30
One-vs-all (one-vs-rest)
𝑖
❖ Train a logistic regression classifier ℎ𝜃 (𝑥) for each class i to predict the
probability that y = i.
❖ Given a new input 𝑥, pick the class 𝑖 that maximizes
𝑖
max ℎ𝜃 𝑥
i
3/11/2024 31
Regularization
❖ The problem of overfitting
❖ We discussed both linear and logistic regressions which works well for many
problems but when we apply to certain machine learning problems they can lead
us a problem called overfitting which cause very poor performance of
algorithms.
❖ What is overfitting?
❖ Lets reconsider housing price prediction using linear regression for the following
training examples. Price ($)
in 1000’s
Price
Size in feet^2
3/11/2024 32
Regularization
❖ For this prediction we can fit a linear function, quadratic function or higher
order polynomial function as shown below.
Price ($) Price ($) Price ($)
in 1000’s in 1000’s in 1000’s
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 +
𝜃3 𝑥 3 + 𝜃4 𝑥 4 + ⋯
Underfitting Correct fit Overfitting
(high bias) (high variance)
❖ Overfitting: if we have too many features, the learned hypothesis may fit the
training set very well(J(θ) ~ 0), but fail to generalize to new examples(predict
prices on new examples).
Slide credit: Andrew Ng
Regularization
❖Similarly if we consider logistic regression
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 𝜃4 𝑥 4
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2
❖Doing so the above overfitting with higher order polynomial will become a
quadratic plus very small num and we will find better fit classifier as shown
below.
Slide credit: Andrew Ng
Cost Function with Regularization
Price ($)
in 1000’s
Size in feet^2
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 𝜃4 𝑥 4
3/11/2024 38
Regularization
❖ What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?
❖ Algorithm fails to eliminate overfitting.
❖ Algorithm results in underfitting. (Fails to fit even training data well).
❖ Gradient descent will fail to converge.
❖ In this case we will penalize 𝜃1 , 𝜃2 , 𝜃3 , 𝜃4 heavily and we end up 𝜃1 ≈ 0, 𝜃2 ≈
0, 𝜃3 ≈ 0, 𝜃4 ≈0.
Price ($)
in 1000’s
ℎ𝜃 𝑥
= 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯
+ 𝜃𝑛 𝑥𝑛 = 𝜃 ⊤ 𝑥
Size in feet^2
3/11/2024 39
Regularization
❖ So to choose best fit and minimize parameters as well we have to care about
the regularization parameter as well.
Regularized linear regression
𝑚 𝑛
1 2
𝐽 𝜃 = ℎ𝜃 𝑥 𝑖 −𝑦 𝑖 + 𝜆 𝜃𝑗2
2𝑚
𝑖=1 𝑗=1
Goal: min 𝐽(𝜃)
𝜃
𝑛: Number of features
𝜃0 is not panelized
3/11/2024 40
Gradient Descent (Previously)
Repeat {
𝑚
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1
𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ℎ𝜃 𝑥 −𝑦 𝑥𝑗 (𝑗 = 1, 2, 3, ⋯ , 𝑛)
𝑚
𝑖=1
}
Regularized Gradient descent
Repeat {
𝑚
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1
𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ℎ𝜃 𝑥 −𝑦 𝑥𝑗 + 𝜆𝜃𝑗 (𝑗 = 1, 2, 3, ⋯ , 𝑛)
𝑚
𝑖=1
}
3/11/2024 41
Comparison
❖ Regularized linear regression
𝑚
𝜆 1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 (1 − 𝛼 ) − 𝛼 ℎ𝜃 𝑥 −𝑦 𝑥𝑗
𝑚 𝑚
𝑖=1
3/11/2024 42
Regularized Normal Equation
𝑥 1 ⊤ 𝑦 (1)
2 ⊤ 𝑦 (2)
X= 𝑥 ∈ 𝑅𝑚×(𝑛+1) 𝑦= ∈ 𝑅𝑚
⋮ ⋮
𝑥 𝑚 ⊤ 𝑦 (𝑚)
• min 𝐽(𝜃)
𝜃
−1
0 0 ⋯ 0
0 1 0 0
•𝜃= 𝑋⊤𝑋 + 𝜆 𝑋⊤𝑦
⋮ ⋮ ⋱ ⋮
0 0 0 1
(𝑛 + 1 ) × (𝑛 + 1)
3/11/2024 43
Regularized logistic regression
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 +
𝜃6 𝑥13 𝑥2 + 𝜃7 𝑥1 𝑥23 + ⋯ )
Age
Tumor Size
❖Cost function:
𝑚 𝑛
1 𝜆
𝐽 𝜃 = 𝑦 𝑖
log ℎ𝜃 𝑥 𝑖 𝑖
+ (1 − 𝑦 ) log 1 − ℎ𝜃 𝑥 𝑖
+ 𝜃𝑗2
𝑚 2
𝑖=1 𝑗=1
3/11/2024 44
Regularized Gradient descent
Repeat {
𝑚
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1
𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ℎ𝜃 𝑥 −𝑦 𝑥𝑗 − 𝜆𝜃𝑗
𝑚
𝑖=1
}
1
ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
1 +𝑒
3/11/2024 45