0% found this document useful (0 votes)
8 views45 pages

Ch2Regression and Regularization1

This document discusses logistic regression and normal equation methods for linear regression. It explains how normal equation finds the optimal parameters in one step by taking the inverse of the X transpose X matrix and multiplying it with X transpose y. It then covers logistic regression, interpreting the hypothesis output, decision boundaries, cost functions and other concepts related to logistic regression classification.

Uploaded by

fik55shu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views45 pages

Ch2Regression and Regularization1

This document discusses logistic regression and normal equation methods for linear regression. It explains how normal equation finds the optimal parameters in one step by taking the inverse of the X transpose X matrix and multiplying it with X transpose y. It then covers logistic regression, interpreting the hypothesis output, decision boundaries, cost functions and other concepts related to logistic regression classification.

Uploaded by

fik55shu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Logistic Regression

Electrical and Computer Engineering

3/11/2024 1
Normal Equation
❖ Here we will see normal equation which for some linear regression
problems will give us much better way to solve the optimum value of
parameter θ.
❖ Gradient descent needs a number of iterations to reach the optimum value
where as normal equation which is analytical method it takes one step to get
optimum value.
❖ We know how to determine the minimum value from our calculus and here
also the same principle will be applied.
𝑚
1 2
𝐽 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛 = ෍ ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖)
2𝑚
𝑖=1
𝜕
𝐽 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛 = 0 𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑗 𝑎𝑛𝑑
𝜕𝜃𝑗
𝑠𝑜𝑙𝑣𝑒 𝑓𝑜𝑟 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛
3/11/2024 2
Normal Equation
❖ Example m=4

❖ To apply normal equation take these data sets and add on extra column for
x0 and then we will find:

3/11/2024 3
Normal Equation
❖ Next construct a matrix X which contains all features and a vector y form
outputs.

❖ Where X is m x (n+1) matrix and y m-dimensional column vector.


❖ Finally you take X transpose and multiply by X then the whole inverse
multiplied by X transpose by y equate to θ and solve for θ will give values
of θ that minimizes cost function.

 = (XTX)-1XTy

3/11/2024 4
Normal Equation
❖ To generalize for m number of data set (𝑥 1 , 𝑦1 ), … , (𝑥 𝑚 , 𝑦 𝑚 ) and n
number of features

❖ Then matrix X called the design matrix will be:

3/11/2024 5
Normal Equation
❖ Then matrix X called the design matrix will be:
(𝑥 (1) )𝑇
(𝑥 (2) )𝑇
𝑋 = (𝑥 (3) )𝑇

(𝑥 𝑚 )𝑇
❖ And thus after setting each we can evaluate the following equation.

 = (XTX)-1XTy
❖ Inverse and transpose of a matrix can be implemented on matlab/octave and
it is as follows:
❖ We used feature scaling for gradient descent method but not necessary in
normal equation methods.

3/11/2024 6
Normal Equation
❖ Lets see advantages and disadvantages of gradient descent and normal
equation methods.
❖ For m training examples, n features.

Gradient Descent Normal Equation


▪ Need to choose  ▪ No need to choose 
▪ Needs many iterations ▪ Don’t need to iterate
▪ Works well even when n is large ▪ Need to compute (XTX)-1
▪ Slow if n is very large.
❖ Normal equation method may be feasible for n not more than thousands, but
if higher better to go for gradient descent.

3/11/2024 7
Normal Equation
❖ Normal Equation and Non-invertibility

❖ There are two conditions which cause non-invertibility to occur.

3/11/2024 8
Logistic Regression
❖ If we want to predict employees salary increment based on their
performance, we can use linear regression.
❖ Again if we want to know whether an employee would get a promotion or
not, in this case there has to be threshold values to decide whether an
employee will get a promotion or not.
❖ Takes a probabilistic approach to learning discriminative functions (i.e., a
classifier)
❖ Instead of just predicting the class, give the probability of the instance
being that class, i.e., learn p(y | x).

3/11/2024 9
Logistic Regression
❖ Comparison to perceptron:
❖ Perception doesn’t produce probability estimate
❖ Perception are only interested in producing a discriminative model
❖ We know that 0 < p(event) <1
❖ h(x) should give p(y = 1 | x; )
❖ Logistic regression model want 0 ≤ ℎ𝜃 𝑥 ≤ 1.

3/11/2024 10
Hypothesis Representation
❖ Logistic regression model is define ℎ𝜃 𝑥 𝑎𝑠:
ℎ𝜃 𝑥 = 𝑔(𝜃 𝑇 𝑥)
❖ Where
1
𝑔(𝑧) =
1 + 𝑒 −𝑧
❖ And this is called the sigmoid function/ logistic function. That is why it is
called logistic regression.
❖ Then putting together ℎ𝜃 𝑥 becomes:

1
ℎ𝜃 (𝑥) = 𝑇𝑥
1 + 𝑒 −𝜃

3/11/2024 11
Logistic Regression
❖ If we plot the sigmoid function g(z), it looks like as given below.
Logistic / Sigmoid Function

g(z)

❖ Here we can see that g(z) is between 0 and 1 the same is true for ℎ𝜃 𝑥 .
❖ Given ℎ𝜃 𝑥 what we need to do is fit the parameters ℎ𝜃 𝑥 to our data or
predicting value of parameters theta for given data set.

3/11/2024 12
Interpretation of Hypothesis Output- ℎ𝜃(𝑥)
❖ ℎ𝜃 𝑥 = estimated probability that y = 1 on input x

❖ More formally
ℎ𝜃 (𝑥) = 𝑃(𝑦 = 1|𝑥; 𝜃)
❖ And as we know this means that “probability that y = 1, given x,
parameterized by θ”.
❖ y has two possible values that is either 0 or 1.
𝑃 𝑦 = 0 𝑥; 𝜃 + 𝑃 𝑦 = 1 𝑥; 𝜃 = 1
𝑃 𝑦 = 0 𝑥; 𝜃 = 1 − 𝑃 𝑦 = 1 𝑥; 𝜃
3/11/2024 13
Decision Boundary
❖ If we want to predict as y = 1 and y = 0 here is how we can deal.
❖ predict y = 1 if ℎ𝜃 𝑥 ≥ 0.5
❖ Predict y = 0 if ℎ𝜃 𝑥 ˂ 0.5
❖ When we look at the plot of sigmoid function g(z) ≥ 0.5 when z ≥ 0.
❖ Similarly ℎ𝜃 𝑥 = 𝑔 𝜃 𝑇 𝑥 ≥ 0.5 𝑤ℎ𝑒𝑛𝑒𝑣𝑒𝑟 𝜃 𝑇 𝑥 ≥ 0 and ℎ𝜃 𝑥 =
𝑔 𝜃 𝑇 𝑥 < 0.5 𝑤ℎ𝑒𝑛𝑒𝑣𝑒𝑟 𝜃 𝑇 𝑥 < 0
❖ Thus we predict y = 1 whenever 𝜃 𝑇 𝑥 ≥ 0 and y = 0 whenever 𝜃 𝑇 𝑥 <
0.

3/11/2024 14
Decision Boundary
❖ Lets consider a training set and hypothesis as shown below.

❖ And the parameters vector be 𝜃0 = −3, 𝜃1 = 1 𝑎𝑛𝑑 𝜃2 = 1


❖ Predict “y=1” if −3 + 𝑥1 + 𝑥2 ≥ 0
❖ For example if we take 3 to the right and on the equal to equation, then we
find straight line equation 𝑥1 + 𝑥2 =3 which divides the data set as follows.
❖ Similarly predict “y=0” if 𝑥1 + 𝑥2 ˂3 which lies below the straight line that
divides the data set.

3/11/2024 15
Decision Boundary

❖ And such line which divides the data set is known as decision boundary.
❖ Lets consider another example with non-linear decision boundary as
shown below.

3/11/2024 16
Decision Boundary
❖ For this example we can have hypothesis as given below
❖ ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥12 + 𝜃4 𝑥22
❖ If we choose 𝜃0 = 1, 𝜃1 = 0, 𝜃2 = 0, 𝜃3 = 1 𝑎𝑛𝑑 𝜃4 =
1, 𝑡ℎ𝑒𝑛 𝑤𝑒 𝑐𝑎𝑛 𝑑𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑎𝑠 𝑓𝑜𝑙𝑙𝑜𝑤𝑠.

❖ Then if we sub 𝑥12 + 𝑥22 =1, then this gives a circle which is a decision
boundary.

Decision Boundary

3/11/2024 17
Decision Boundary
❖ Even we can have more non-linear decision boundary like as shown below and
for such case ℎ𝜃 𝑥 will be a function of higher order polynomial.

3/11/2024 18
Decision Boundary
❖ Example: given the following dataset answer the following questions using
regression as classifier.
Study Hours Pass (1) / Fail
❖ Calculate the probability of pass for (0)
the student who studied 33 hours. 28 0
❖ At least how many hours student
15 0
should study that makes him to pass
the course with the probability of 33 1
more than 94%. 27 1
❖ Use the model 38 1

3/11/2024 19
Cost Function
❖ Given training set {(x(1),y(1)), (x(2),y(2)), (x(3),y(3)),…, (x(m),y(m))},
where xT  [x0 x1…xn], x0 = 1, y  {0,1} and  T = [0 1 … n]
1
ℎ𝜃 (𝑥) = 𝑇
1 + 𝑒 −𝜃 𝑥
❖Then, how to choose parameters ?
❖Logistic regression cost function

❖ Rewriting Cost(ℎ𝜃 𝑥 ,y) as one line equation:


Cost ℎ𝜃 𝑥 , 𝑦 = −𝑦 log h𝜃 x − (1 − y) log 1 − ℎ𝜃 𝑥

3/11/2024 20
Cost Function
❖ For our understanding lets plot this cost function for y = 1 given as follows.

❖ cost = 0 if y = 1, ℎ𝜃 𝑥 = 1 and this is where we want to be which implies if we


predict correctly cost is zero.
❖ But as ℎ𝜃 𝑥 →0; Cost →∞
❖ If ℎ𝜃 𝑥 = 0, (predict p(y = 1|x;θ) = 0), but y = 1, we will penalize learning
algorithm by a very large cost.

3/11/2024 21
Cost Function
❖ Again for our understanding lets plot the cost function for y = 0 shown
below.

❖ Similarly, you can see that cost = 0 if y = 0, ℎ𝜃 𝑥 = 0 and this is where we


want to be which implies if we predict correctly cost is zero.
❖ But as ℎ𝜃 𝑥 →1; Cost →∞
❖ If ℎ𝜃 𝑥 = 1, (predict p(y = 0|x; θ) = 1), but y = 0, we will penalize
learning algorithm by a very large cost.
3/11/2024 22
Cost Function
❖ The overall cost function for logistic regression is:
𝑚
1
𝐽 𝜃 = ෍ 𝐶𝑜𝑠𝑡(ℎ𝜃 (𝑥 𝑖 ), 𝑦 (𝑖) )
𝑚
𝑖=1
𝑚
1
𝐽(𝜃) = − ෍ 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1
❖ Why this function? It is convex.
❖ This cost function is the wright from statistics point and it is used to find θ.
❖ Given this cost function in order to fit the parameters θ:
❖ Goal: min 𝐽(𝜃)
𝜃
❖ We want to find values of θ that minimize J(θ), Good news: Convex
function! Bad news: No analytical solution.
❖ And the usual template of gradient descent is given as follows.
3/11/2024 23
Gradient Descent
Gradient descent for Linear Regression
Repeat {
𝑚
1 𝑖 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 − 𝑦 (𝑖) 𝑥𝑗 ℎ𝜃 𝑥 = 𝜃 ⊤ 𝑥
𝑚
𝑖=1
}
Gradient descent for Logistic Regression
Repeat {
𝑚
1 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) 𝑥𝑗
𝑚
𝑖=1
}
❖ Prediction: once determining , predict y for given new 𝑥 using:
1
ℎ𝜃 𝑥 = ⊤
1+𝑒 −𝜃 𝑥

3/11/2024 24
Gradient Descent
❖ Algorithms looks identical to linear regression except the difference in expression
of ℎ𝜃 𝑥 .
❖ Previously we discussed how to make sure gradient descent for linear regression
converge correctly. The same method can be applied to make sure gradient descent
for logistic regression.
❖ In the gradient descent algorithm implementation of logistic regression we will
have n θ that must be updated simultaneously and thus for loop is needed.
❖ The idea of feature scaling we applied for linear regression can be also used in
logistic regression for faster convergence.
❖ Other than gradient descent we can have advanced optimization algorithms
which can be used to run logistic regression much more quickly and also enable the
algorithm to scale up if we have large number of features.
3/11/2024 25
Gradient Descent

❖ Conjugate gradient, BFGS and L-BFGS are examples of more sophisticated


and advanced optimization algorithm and neither of them compute J(θ) and
its derivative rather apply sophisticated strategies to minimize J(θ).
❖ Discussing the details of these algorithms is beyond the scope of this course.

3/11/2024 26
Gradient Descent
❖ Advantages of Advanced Algorithms over gradient algorithms include:
❖ No need manually pick α
❖ Often faster than gradient descent
❖ Disadvantage
❖ More complex

3/11/2024 27
Multi-class Classification: One-vs-all
❖ Here we will see multi-class classification problem particularly an algorithm
called one-vs-all classification.
❖ What is a multi-class classification problem? Here are some examples.
❖ Email foldering/tagging: Work, Friends, Family, Hobby
❖ Medical Diagrams: Not ill, Cold, Flu
❖ Weather: Sunny, Cloudy, Rain, Snow
❖ Data set for binary and multi-class classification are given below just for
comparison.

3/11/2024 28
Logistic Regression
Binary classification Multiclass classification

𝑥2 𝑥2

𝑥1 𝑥1
❖ And classifying of the multi-class can be done applying the one-vs-all also called
one-vs-rest as follows.

3/11/2024 29
One-vs-all (one-vs-rest)

ℎ𝜃
1
𝑥
𝑥2
𝑥2
𝑥1

𝑥1 ℎ𝜃
2
𝑥 𝑥2
Class 1: 𝑥1
Class 2:
Class 3: 3
ℎ𝜃 𝑥
𝑥2
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1
3/11/2024 30
One-vs-all (one-vs-rest)
𝑖
❖ Train a logistic regression classifier ℎ𝜃 (𝑥) for each class i to predict the
probability that y = i.
❖ Given a new input 𝑥, pick the class 𝑖 that maximizes
𝑖
max ℎ𝜃 𝑥
i

3/11/2024 31
Regularization
❖ The problem of overfitting
❖ We discussed both linear and logistic regressions which works well for many
problems but when we apply to certain machine learning problems they can lead
us a problem called overfitting which cause very poor performance of
algorithms.
❖ What is overfitting?
❖ Lets reconsider housing price prediction using linear regression for the following
training examples. Price ($)
in 1000’s
Price

Size in feet^2

3/11/2024 32
Regularization
❖ For this prediction we can fit a linear function, quadratic function or higher
order polynomial function as shown below.
Price ($) Price ($) Price ($)
in 1000’s in 1000’s in 1000’s

Size in feet^2 Size in feet^2 Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 +
𝜃3 𝑥 3 + 𝜃4 𝑥 4 + ⋯
Underfitting Correct fit Overfitting
(high bias) (high variance)

❖ Overfitting: if we have too many features, the learned hypothesis may fit the
training set very well(J(θ) ~ 0), but fail to generalize to new examples(predict
prices on new examples).
Slide credit: Andrew Ng
Regularization
❖Similarly if we consider logistic regression

Age Age Age

Tumor Size Tumor Size Tumor Size


ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 + ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 )
𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 ) 𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 +
𝜃6 𝑥13 𝑥2 + 𝜃7 𝑥1 𝑥23 + ⋯ )
Underfitting Correct fit Overfitting
(high bias) (high variance)

Slide credit: Andrew Ng


Addressing overfitting
• 𝑥1 = size of house
• 𝑥2 = no. of bedrooms Price ($)
in 1000’s
• 𝑥3 = no. of floors
• 𝑥4 = age of house
• 𝑥5 = kitchen size
•⋮ Size in feet^2
• 𝑥100
1. Reduce number of features
❖ Manually select which features to keep.
❖ Model selection algorithm(will be discussed latter)
❖ The main disadvantage of this is loosing some information.
2. Regularization
❖ Keep all the features, but reduce magnitude/ values of parameters 𝜃𝑗 .
❖ Works well when we have a lot of features, each of which contributes a bit for
predicting y. Slide credit: Andrew Ng
Cost Function with Regularization
Price ($) Price ($)
in 1000’s in 1000’s

Size in feet^2 Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 𝜃4 𝑥 4
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2

❖ Suppose we penalize and make 𝜃3 , 𝜃4 really small(approximately zero).


𝑚
1 2
min 𝐽 𝜃 = ෍ ℎ𝜃 𝑥 𝑖
−𝑦 𝑖
+ 1000 𝜃32 + 1000 𝜃42
𝜃 2𝑚
𝑖=1

❖Doing so the above overfitting with higher order polynomial will become a
quadratic plus very small num and we will find better fit classifier as shown
below.
Slide credit: Andrew Ng
Cost Function with Regularization
Price ($)
in 1000’s

Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 𝜃4 𝑥 4

❖ Here is the idea behind regularization.


❖ Small values for parameters 𝜃0 , 𝜃1 , 𝜃2 , … , 𝜃𝑛
❖ Simpler hypothesis
❖ Less prone to overfitting

Slide credit: Andrew Ng


Regularization
❖ In regularization what we can do is take cost function and add extra
regularization term which will do shrinking of every terms / parameters.
𝑚 𝑛
1 2
𝐽 𝜃 = ෍ ℎ𝜃 𝑥 𝑖 −𝑦 𝑖 + 𝜆 ෍ 𝜃𝑗2
2𝑚
𝑖=1 𝑗=1
min 𝐽(𝜃) 𝜆: Regularization parameter
𝜃

❖ λ here is called the regularization parameter and its main objective is to


control tradeoff between two different goals.
❖ 1st goal: fit the training data well and achieved by first term of J(θ).
❖ 2nd goal: keep the parameters small and thus keeping the hypothesis
simple achieved by the second term of J(θ)

3/11/2024 38
Regularization
❖ What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?
❖ Algorithm fails to eliminate overfitting.
❖ Algorithm results in underfitting. (Fails to fit even training data well).
❖ Gradient descent will fail to converge.
❖ In this case we will penalize 𝜃1 , 𝜃2 , 𝜃3 , 𝜃4 heavily and we end up 𝜃1 ≈ 0, 𝜃2 ≈
0, 𝜃3 ≈ 0, 𝜃4 ≈0.
Price ($)
in 1000’s

ℎ𝜃 𝑥
= 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯
+ 𝜃𝑛 𝑥𝑛 = 𝜃 ⊤ 𝑥
Size in feet^2

3/11/2024 39
Regularization
❖ So to choose best fit and minimize parameters as well we have to care about
the regularization parameter as well.
Regularized linear regression
𝑚 𝑛
1 2
𝐽 𝜃 = ෍ ℎ𝜃 𝑥 𝑖 −𝑦 𝑖 + 𝜆 ෍ 𝜃𝑗2
2𝑚
𝑖=1 𝑗=1
Goal: min 𝐽(𝜃)
𝜃
𝑛: Number of features
𝜃0 is not panelized

3/11/2024 40
Gradient Descent (Previously)
Repeat {
𝑚
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1
𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗 (𝑗 = 1, 2, 3, ⋯ , 𝑛)
𝑚
𝑖=1
}
Regularized Gradient descent
Repeat {
𝑚
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1

𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗 + 𝜆𝜃𝑗 (𝑗 = 1, 2, 3, ⋯ , 𝑛)
𝑚
𝑖=1
}
3/11/2024 41
Comparison
❖ Regularized linear regression
𝑚
𝜆 1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 (1 − 𝛼 ) − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗
𝑚 𝑚
𝑖=1

❖Un-regularized linear regression


𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗
𝑚
𝑖=1

3/11/2024 42
Regularized Normal Equation
𝑥 1 ⊤ 𝑦 (1)
2 ⊤ 𝑦 (2)
X= 𝑥 ∈ 𝑅𝑚×(𝑛+1) 𝑦= ∈ 𝑅𝑚
⋮ ⋮
𝑥 𝑚 ⊤ 𝑦 (𝑚)
• min 𝐽(𝜃)
𝜃
−1
0 0 ⋯ 0
0 1 0 0
•𝜃= 𝑋⊤𝑋 + 𝜆 𝑋⊤𝑦
⋮ ⋮ ⋱ ⋮
0 0 0 1
(𝑛 + 1 ) × (𝑛 + 1)

3/11/2024 43
Regularized logistic regression

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 +
𝜃6 𝑥13 𝑥2 + 𝜃7 𝑥1 𝑥23 + ⋯ )
Age

Tumor Size
❖Cost function:
𝑚 𝑛
1 𝜆
𝐽 𝜃 = ෍𝑦 𝑖
log ℎ𝜃 𝑥 𝑖 𝑖
+ (1 − 𝑦 ) log 1 − ℎ𝜃 𝑥 𝑖
+ ෍ 𝜃𝑗2
𝑚 2
𝑖=1 𝑗=1

3/11/2024 44
Regularized Gradient descent
Repeat {
𝑚
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1

𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗 − 𝜆𝜃𝑗
𝑚
𝑖=1
}

1
ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
1 +𝑒

3/11/2024 45

You might also like