0% found this document useful (0 votes)

8 views45 pages

Ch2Regression and Regularization1

This document discusses logistic regression and normal equation methods for linear regression. It explains how normal equation finds the optimal parameters in one step by taking the inverse of the X transpose X matrix and multiplying it with X transpose y. It then covers logistic regression, interpreting the hypothesis output, decision boundaries, cost functions and other concepts related to logistic regression classification.

Uploaded by

fik55shu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views45 pages

Ch2Regression and Regularization1

Uploaded by

fik55shu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Logistic Regression

Electrical and Computer Engineering

3/11/2024 1
Normal Equation
❖ Here we will see normal equation which for some linear regression
problems will give us much better way to solve the optimum value of
parameter θ.
❖ Gradient descent needs a number of iterations to reach the optimum value
where as normal equation which is analytical method it takes one step to get
optimum value.
❖ We know how to determine the minimum value from our calculus and here
also the same principle will be applied.
𝑚
1 2
𝐽 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛 = ෍ ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖)
2𝑚
𝑖=1
𝜕
𝐽 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛 = 0 𝑓𝑜𝑟 𝑒𝑣𝑒𝑟𝑦 𝑗 𝑎𝑛𝑑
𝜕𝜃𝑗
𝑠𝑜𝑙𝑣𝑒 𝑓𝑜𝑟 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛
3/11/2024 2
Normal Equation
❖ Example m=4

❖ To apply normal equation take these data sets and add on extra column for
x0 and then we will find:

3/11/2024 3
Normal Equation
❖ Next construct a matrix X which contains all features and a vector y form
outputs.

❖ Where X is m x (n+1) matrix and y m-dimensional column vector.

❖ Finally you take X transpose and multiply by X then the whole inverse
multiplied by X transpose by y equate to θ and solve for θ will give values
of θ that minimizes cost function.

 = (XTX)-1XTy

3/11/2024 4
Normal Equation
❖ To generalize for m number of data set (𝑥 1 , 𝑦1 ), … , (𝑥 𝑚 , 𝑦 𝑚 ) and n
number of features

❖ Then matrix X called the design matrix will be:

3/11/2024 5
Normal Equation
❖ Then matrix X called the design matrix will be:
(𝑥 (1) )𝑇
(𝑥 (2) )𝑇
𝑋 = (𝑥 (3) )𝑇
…
(𝑥 𝑚 )𝑇
❖ And thus after setting each we can evaluate the following equation.

 = (XTX)-1XTy
❖ Inverse and transpose of a matrix can be implemented on matlab/octave and
it is as follows:
❖ We used feature scaling for gradient descent method but not necessary in
normal equation methods.

3/11/2024 6
Normal Equation
❖ Lets see advantages and disadvantages of gradient descent and normal
equation methods.
❖ For m training examples, n features.

Gradient Descent Normal Equation

▪ Need to choose  ▪ No need to choose 
▪ Needs many iterations ▪ Don’t need to iterate
▪ Works well even when n is large ▪ Need to compute (XTX)-1
▪ Slow if n is very large.
❖ Normal equation method may be feasible for n not more than thousands, but
if higher better to go for gradient descent.

3/11/2024 7
Normal Equation
❖ Normal Equation and Non-invertibility

❖ There are two conditions which cause non-invertibility to occur.

3/11/2024 8
Logistic Regression
❖ If we want to predict employees salary increment based on their
performance, we can use linear regression.
❖ Again if we want to know whether an employee would get a promotion or
not, in this case there has to be threshold values to decide whether an
employee will get a promotion or not.
❖ Takes a probabilistic approach to learning discriminative functions (i.e., a
classifier)
❖ Instead of just predicting the class, give the probability of the instance
being that class, i.e., learn p(y | x).

3/11/2024 9
Logistic Regression
❖ Comparison to perceptron:
❖ Perception doesn’t produce probability estimate
❖ Perception are only interested in producing a discriminative model
❖ We know that 0 < p(event) <1
❖ h(x) should give p(y = 1 | x; )
❖ Logistic regression model want 0 ≤ ℎ𝜃 𝑥 ≤ 1.

3/11/2024 10
Hypothesis Representation
❖ Logistic regression model is define ℎ𝜃 𝑥 𝑎𝑠:
ℎ𝜃 𝑥 = 𝑔(𝜃 𝑇 𝑥)
❖ Where
1
𝑔(𝑧) =
1 + 𝑒 −𝑧
❖ And this is called the sigmoid function/ logistic function. That is why it is
called logistic regression.
❖ Then putting together ℎ𝜃 𝑥 becomes:

1
ℎ𝜃 (𝑥) = 𝑇𝑥
1 + 𝑒 −𝜃

3/11/2024 11
Logistic Regression
❖ If we plot the sigmoid function g(z), it looks like as given below.
Logistic / Sigmoid Function

g(z)

❖ Here we can see that g(z) is between 0 and 1 the same is true for ℎ𝜃 𝑥 .
❖ Given ℎ𝜃 𝑥 what we need to do is fit the parameters ℎ𝜃 𝑥 to our data or
predicting value of parameters theta for given data set.

3/11/2024 12
Interpretation of Hypothesis Output- ℎ𝜃(𝑥)
❖ ℎ𝜃 𝑥 = estimated probability that y = 1 on input x

❖ More formally
ℎ𝜃 (𝑥) = 𝑃(𝑦 = 1|𝑥; 𝜃)
❖ And as we know this means that “probability that y = 1, given x,
parameterized by θ”.
❖ y has two possible values that is either 0 or 1.
𝑃 𝑦 = 0 𝑥; 𝜃 + 𝑃 𝑦 = 1 𝑥; 𝜃 = 1
𝑃 𝑦 = 0 𝑥; 𝜃 = 1 − 𝑃 𝑦 = 1 𝑥; 𝜃
3/11/2024 13
Decision Boundary
❖ If we want to predict as y = 1 and y = 0 here is how we can deal.
❖ predict y = 1 if ℎ𝜃 𝑥 ≥ 0.5
❖ Predict y = 0 if ℎ𝜃 𝑥 ˂ 0.5
❖ When we look at the plot of sigmoid function g(z) ≥ 0.5 when z ≥ 0.
❖ Similarly ℎ𝜃 𝑥 = 𝑔 𝜃 𝑇 𝑥 ≥ 0.5 𝑤ℎ𝑒𝑛𝑒𝑣𝑒𝑟 𝜃 𝑇 𝑥 ≥ 0 and ℎ𝜃 𝑥 =
𝑔 𝜃 𝑇 𝑥 < 0.5 𝑤ℎ𝑒𝑛𝑒𝑣𝑒𝑟 𝜃 𝑇 𝑥 < 0
❖ Thus we predict y = 1 whenever 𝜃 𝑇 𝑥 ≥ 0 and y = 0 whenever 𝜃 𝑇 𝑥 <
0.

3/11/2024 14
Decision Boundary
❖ Lets consider a training set and hypothesis as shown below.

❖ And the parameters vector be 𝜃0 = −3, 𝜃1 = 1 𝑎𝑛𝑑 𝜃2 = 1

❖ Predict “y=1” if −3 + 𝑥1 + 𝑥2 ≥ 0
❖ For example if we take 3 to the right and on the equal to equation, then we
find straight line equation 𝑥1 + 𝑥2 =3 which divides the data set as follows.
❖ Similarly predict “y=0” if 𝑥1 + 𝑥2 ˂3 which lies below the straight line that
divides the data set.

3/11/2024 15
Decision Boundary

❖ And such line which divides the data set is known as decision boundary.
❖ Lets consider another example with non-linear decision boundary as
shown below.

3/11/2024 16
Decision Boundary
❖ For this example we can have hypothesis as given below
❖ ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥12 + 𝜃4 𝑥22
❖ If we choose 𝜃0 = 1, 𝜃1 = 0, 𝜃2 = 0, 𝜃3 = 1 𝑎𝑛𝑑 𝜃4 =
1, 𝑡ℎ𝑒𝑛 𝑤𝑒 𝑐𝑎𝑛 𝑑𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑎𝑠 𝑓𝑜𝑙𝑙𝑜𝑤𝑠.

❖ Then if we sub 𝑥12 + 𝑥22 =1, then this gives a circle which is a decision
boundary.

Decision Boundary

3/11/2024 17
Decision Boundary
❖ Even we can have more non-linear decision boundary like as shown below and
for such case ℎ𝜃 𝑥 will be a function of higher order polynomial.

3/11/2024 18
Decision Boundary
❖ Example: given the following dataset answer the following questions using
regression as classifier.
Study Hours Pass (1) / Fail
❖ Calculate the probability of pass for (0)
the student who studied 33 hours. 28 0
❖ At least how many hours student
15 0
should study that makes him to pass
the course with the probability of 33 1
more than 94%. 27 1
❖ Use the model 38 1

3/11/2024 19
Cost Function
❖ Given training set {(x(1),y(1)), (x(2),y(2)), (x(3),y(3)),…, (x(m),y(m))},
where xT  [x0 x1…xn], x0 = 1, y  {0,1} and  T = [0 1 … n]
1
ℎ𝜃 (𝑥) = 𝑇
1 + 𝑒 −𝜃 𝑥
❖Then, how to choose parameters ?
❖Logistic regression cost function

❖ Rewriting Cost(ℎ𝜃 𝑥 ,y) as one line equation:

Cost ℎ𝜃 𝑥 , 𝑦 = −𝑦 log h𝜃 x − (1 − y) log 1 − ℎ𝜃 𝑥

3/11/2024 20
Cost Function
❖ For our understanding lets plot this cost function for y = 1 given as follows.

❖ cost = 0 if y = 1, ℎ𝜃 𝑥 = 1 and this is where we want to be which implies if we

predict correctly cost is zero.
❖ But as ℎ𝜃 𝑥 →0; Cost →∞
❖ If ℎ𝜃 𝑥 = 0, (predict p(y = 1|x;θ) = 0), but y = 1, we will penalize learning
algorithm by a very large cost.

3/11/2024 21
Cost Function
❖ Again for our understanding lets plot the cost function for y = 0 shown
below.

❖ Similarly, you can see that cost = 0 if y = 0, ℎ𝜃 𝑥 = 0 and this is where we

want to be which implies if we predict correctly cost is zero.
❖ But as ℎ𝜃 𝑥 →1; Cost →∞
❖ If ℎ𝜃 𝑥 = 1, (predict p(y = 0|x; θ) = 1), but y = 0, we will penalize
learning algorithm by a very large cost.
3/11/2024 22
Cost Function
❖ The overall cost function for logistic regression is:
𝑚
1
𝐽 𝜃 = ෍ 𝐶𝑜𝑠𝑡(ℎ𝜃 (𝑥 𝑖 ), 𝑦 (𝑖) )
𝑚
𝑖=1
𝑚
1
𝐽(𝜃) = − ෍ 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1
❖ Why this function? It is convex.
❖ This cost function is the wright from statistics point and it is used to find θ.
❖ Given this cost function in order to fit the parameters θ:
❖ Goal: min 𝐽(𝜃)
𝜃
❖ We want to find values of θ that minimize J(θ), Good news: Convex
function! Bad news: No analytical solution.
❖ And the usual template of gradient descent is given as follows.
3/11/2024 23
Gradient Descent
Gradient descent for Linear Regression
Repeat {
𝑚
1 𝑖 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 − 𝑦 (𝑖) 𝑥𝑗 ℎ𝜃 𝑥 = 𝜃 ⊤ 𝑥
𝑚
𝑖=1
}
Gradient descent for Logistic Regression
Repeat {
𝑚
1 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) 𝑥𝑗
𝑚
𝑖=1
}
❖ Prediction: once determining , predict y for given new 𝑥 using:
1
ℎ𝜃 𝑥 = ⊤
1+𝑒 −𝜃 𝑥

3/11/2024 24
Gradient Descent
❖ Algorithms looks identical to linear regression except the difference in expression
of ℎ𝜃 𝑥 .
❖ Previously we discussed how to make sure gradient descent for linear regression
converge correctly. The same method can be applied to make sure gradient descent
for logistic regression.
❖ In the gradient descent algorithm implementation of logistic regression we will
have n θ that must be updated simultaneously and thus for loop is needed.
❖ The idea of feature scaling we applied for linear regression can be also used in
logistic regression for faster convergence.
❖ Other than gradient descent we can have advanced optimization algorithms
which can be used to run logistic regression much more quickly and also enable the
algorithm to scale up if we have large number of features.
3/11/2024 25
Gradient Descent

❖ Conjugate gradient, BFGS and L-BFGS are examples of more sophisticated

and advanced optimization algorithm and neither of them compute J(θ) and
its derivative rather apply sophisticated strategies to minimize J(θ).
❖ Discussing the details of these algorithms is beyond the scope of this course.

3/11/2024 26
Gradient Descent
❖ Advantages of Advanced Algorithms over gradient algorithms include:
❖ No need manually pick α
❖ Often faster than gradient descent
❖ Disadvantage
❖ More complex

3/11/2024 27
Multi-class Classification: One-vs-all
❖ Here we will see multi-class classification problem particularly an algorithm
called one-vs-all classification.
❖ What is a multi-class classification problem? Here are some examples.
❖ Email foldering/tagging: Work, Friends, Family, Hobby
❖ Medical Diagrams: Not ill, Cold, Flu
❖ Weather: Sunny, Cloudy, Rain, Snow
❖ Data set for binary and multi-class classification are given below just for
comparison.

3/11/2024 28
Logistic Regression
Binary classification Multiclass classification

𝑥2 𝑥2

𝑥1 𝑥1
❖ And classifying of the multi-class can be done applying the one-vs-all also called
one-vs-rest as follows.

3/11/2024 29
One-vs-all (one-vs-rest)

ℎ𝜃
1
𝑥
𝑥2
𝑥2
𝑥1

𝑥1 ℎ𝜃
2
𝑥 𝑥2
Class 1: 𝑥1
Class 2:
Class 3: 3
ℎ𝜃 𝑥
𝑥2
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1
3/11/2024 30
One-vs-all (one-vs-rest)
𝑖
❖ Train a logistic regression classifier ℎ𝜃 (𝑥) for each class i to predict the
probability that y = i.
❖ Given a new input 𝑥, pick the class 𝑖 that maximizes
𝑖
max ℎ𝜃 𝑥
i

3/11/2024 31
Regularization
❖ The problem of overfitting
❖ We discussed both linear and logistic regressions which works well for many
problems but when we apply to certain machine learning problems they can lead
us a problem called overfitting which cause very poor performance of
algorithms.
❖ What is overfitting?
❖ Lets reconsider housing price prediction using linear regression for the following
training examples. Price ($)
in 1000’s
Price

Size in feet^2

3/11/2024 32
Regularization
❖ For this prediction we can fit a linear function, quadratic function or higher
order polynomial function as shown below.
Price ($) Price ($) Price ($)
in 1000’s in 1000’s in 1000’s

Size in feet^2 Size in feet^2 Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 +
𝜃3 𝑥 3 + 𝜃4 𝑥 4 + ⋯
Underfitting Correct fit Overfitting
(high bias) (high variance)

❖ Overfitting: if we have too many features, the learned hypothesis may fit the
training set very well(J(θ) ~ 0), but fail to generalize to new examples(predict
prices on new examples).
Slide credit: Andrew Ng
Regularization
❖Similarly if we consider logistic regression

Age Age Age

Tumor Size Tumor Size Tumor Size

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 + ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 )
𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 ) 𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 +
𝜃6 𝑥13 𝑥2 + 𝜃7 𝑥1 𝑥23 + ⋯ )
Underfitting Correct fit Overfitting
(high bias) (high variance)

Slide credit: Andrew Ng

Addressing overfitting
• 𝑥1 = size of house
• 𝑥2 = no. of bedrooms Price ($)
in 1000’s
• 𝑥3 = no. of floors
• 𝑥4 = age of house
• 𝑥5 = kitchen size
•⋮ Size in feet^2
• 𝑥100
1. Reduce number of features
❖ Manually select which features to keep.
❖ Model selection algorithm(will be discussed latter)
❖ The main disadvantage of this is loosing some information.
2. Regularization
❖ Keep all the features, but reduce magnitude/ values of parameters 𝜃𝑗 .
❖ Works well when we have a lot of features, each of which contributes a bit for
predicting y. Slide credit: Andrew Ng
Cost Function with Regularization
Price ($) Price ($)
in 1000’s in 1000’s

Size in feet^2 Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 𝜃4 𝑥 4
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2

❖ Suppose we penalize and make 𝜃3 , 𝜃4 really small(approximately zero).

𝑚
1 2
min 𝐽 𝜃 = ෍ ℎ𝜃 𝑥 𝑖
−𝑦 𝑖
+ 1000 𝜃32 + 1000 𝜃42
𝜃 2𝑚
𝑖=1

❖Doing so the above overfitting with higher order polynomial will become a
quadratic plus very small num and we will find better fit classifier as shown
below.
Slide credit: Andrew Ng
Cost Function with Regularization
Price ($)
in 1000’s

Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 𝜃4 𝑥 4

❖ Here is the idea behind regularization.

❖ Small values for parameters 𝜃0 , 𝜃1 , 𝜃2 , … , 𝜃𝑛
❖ Simpler hypothesis
❖ Less prone to overfitting

Slide credit: Andrew Ng

Regularization
❖ In regularization what we can do is take cost function and add extra
regularization term which will do shrinking of every terms / parameters.
𝑚 𝑛
1 2
𝐽 𝜃 = ෍ ℎ𝜃 𝑥 𝑖 −𝑦 𝑖 + 𝜆 ෍ 𝜃𝑗2
2𝑚
𝑖=1 𝑗=1
min 𝐽(𝜃) 𝜆: Regularization parameter
𝜃

❖ λ here is called the regularization parameter and its main objective is to

control tradeoff between two different goals.
❖ 1st goal: fit the training data well and achieved by first term of J(θ).
❖ 2nd goal: keep the parameters small and thus keeping the hypothesis
simple achieved by the second term of J(θ)

3/11/2024 38
Regularization
❖ What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?
❖ Algorithm fails to eliminate overfitting.
❖ Algorithm results in underfitting. (Fails to fit even training data well).
❖ Gradient descent will fail to converge.
❖ In this case we will penalize 𝜃1 , 𝜃2 , 𝜃3 , 𝜃4 heavily and we end up 𝜃1 ≈ 0, 𝜃2 ≈
0, 𝜃3 ≈ 0, 𝜃4 ≈0.
Price ($)
in 1000’s

ℎ𝜃 𝑥
= 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯
+ 𝜃𝑛 𝑥𝑛 = 𝜃 ⊤ 𝑥
Size in feet^2

3/11/2024 39
Regularization
❖ So to choose best fit and minimize parameters as well we have to care about
the regularization parameter as well.
Regularized linear regression
𝑚 𝑛
1 2
𝐽 𝜃 = ෍ ℎ𝜃 𝑥 𝑖 −𝑦 𝑖 + 𝜆 ෍ 𝜃𝑗2
2𝑚
𝑖=1 𝑗=1
Goal: min 𝐽(𝜃)
𝜃
𝑛: Number of features
𝜃0 is not panelized

3/11/2024 40
Gradient Descent (Previously)
Repeat {
𝑚
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1
𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗 (𝑗 = 1, 2, 3, ⋯ , 𝑛)
𝑚
𝑖=1
}
Regularized Gradient descent
Repeat {
𝑚
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1

𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗 + 𝜆𝜃𝑗 (𝑗 = 1, 2, 3, ⋯ , 𝑛)
𝑚
𝑖=1
}
3/11/2024 41
Comparison
❖ Regularized linear regression
𝑚
𝜆 1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 (1 − 𝛼 ) − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗
𝑚 𝑚
𝑖=1

❖Un-regularized linear regression

𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗
𝑚
𝑖=1

3/11/2024 42
Regularized Normal Equation
𝑥 1 ⊤ 𝑦 (1)
2 ⊤ 𝑦 (2)
X= 𝑥 ∈ 𝑅𝑚×(𝑛+1) 𝑦= ∈ 𝑅𝑚
⋮ ⋮
𝑥 𝑚 ⊤ 𝑦 (𝑚)
• min 𝐽(𝜃)
𝜃
−1
0 0 ⋯ 0
0 1 0 0
•𝜃= 𝑋⊤𝑋 + 𝜆 𝑋⊤𝑦
⋮ ⋮ ⋱ ⋮
0 0 0 1
(𝑛 + 1 ) × (𝑛 + 1)

3/11/2024 43
Regularized logistic regression

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 +
𝜃6 𝑥13 𝑥2 + 𝜃7 𝑥1 𝑥23 + ⋯ )
Age

Tumor Size
❖Cost function:
𝑚 𝑛
1 𝜆
𝐽 𝜃 = ෍𝑦 𝑖
log ℎ𝜃 𝑥 𝑖 𝑖
+ (1 − 𝑦 ) log 1 − ℎ𝜃 𝑥 𝑖
+ ෍ 𝜃𝑗2
𝑚 2
𝑖=1 𝑗=1

3/11/2024 44
Regularized Gradient descent
Repeat {
𝑚
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1

𝑚
1 𝑖 𝑖 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗 − 𝜆𝜃𝑗
𝑚
𝑖=1
}

1
ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
1 +𝑒

3/11/2024 45

James R. Evans - Statistics, Data Analysis and Decision Modeling International 5th Ed.-Pearson (2013)
86% (14)
James R. Evans - Statistics, Data Analysis and Decision Modeling International 5th Ed.-Pearson (2013)
543 pages
EBA5005 Sample Exam Paper
No ratings yet
EBA5005 Sample Exam Paper
16 pages
Algorithms Notes
No ratings yet
Algorithms Notes
66 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
06 Logistic Regression PDF
No ratings yet
06 Logistic Regression PDF
10 pages
Logistic Regression by IntuitiveAI v2.5
No ratings yet
Logistic Regression by IntuitiveAI v2.5
8 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
Sample Research Paper
No ratings yet
Sample Research Paper
26 pages
ML DSBA Lab2
No ratings yet
ML DSBA Lab2
4 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Week 3 Lecture Notes
No ratings yet
Week 3 Lecture Notes
7 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
Binary Logistic Regression From Scratch
No ratings yet
Binary Logistic Regression From Scratch
10 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Logistic Regression
No ratings yet
Logistic Regression
74 pages
Lecture Note #9 - PEC-CS701E
No ratings yet
Lecture Note #9 - PEC-CS701E
41 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
06 Logistic Regression
No ratings yet
06 Logistic Regression
55 pages
Lec 3
No ratings yet
Lec 3
22 pages
Logistic Regression
No ratings yet
Logistic Regression
37 pages
Logistic Regression
No ratings yet
Logistic Regression
21 pages
Slide 2
No ratings yet
Slide 2
30 pages
03-Logistic Regression
No ratings yet
03-Logistic Regression
59 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
3 Logistic Regression and Regularization
No ratings yet
3 Logistic Regression and Regularization
42 pages
2+logistic Regression
No ratings yet
2+logistic Regression
10 pages
Logistic Regression
No ratings yet
Logistic Regression
6 pages
Notes 05
No ratings yet
Notes 05
51 pages
Logistic Regression
No ratings yet
Logistic Regression
6 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
CH 4
No ratings yet
CH 4
41 pages
CSE445 T4a Logistic Regression
No ratings yet
CSE445 T4a Logistic Regression
38 pages
05 LogisticRegression PDF
No ratings yet
05 LogisticRegression PDF
23 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
3-LG Eval
No ratings yet
3-LG Eval
52 pages
LogisticRegression ExercisesSolutions
No ratings yet
LogisticRegression ExercisesSolutions
5 pages
Lecture 07
No ratings yet
Lecture 07
26 pages
Lec12 Logreg
No ratings yet
Lec12 Logreg
41 pages
Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Machine Learning Using Optimization and Logistic Regression and Sigmoid Function - GRP 06
No ratings yet
Machine Learning Using Optimization and Logistic Regression and Sigmoid Function - GRP 06
31 pages
Binary Logistic Regression 2
No ratings yet
Binary Logistic Regression 2
43 pages
ML 03 Logistic Regression
No ratings yet
ML 03 Logistic Regression
32 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
4.logistic Regression
No ratings yet
4.logistic Regression
16 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Data Science L19 - LogisticRegression
No ratings yet
Data Science L19 - LogisticRegression
52 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Limits and Continuity (Calculus) Engineering Entrance Exams Question Bank
From Everand
Limits and Continuity (Calculus) Engineering Entrance Exams Question Bank
Mohmmad Khaja Shareef
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
OR Chapter - 3 - 100638
No ratings yet
OR Chapter - 3 - 100638
24 pages
EvoPort An Evolutionary Framework For Portfolio
No ratings yet
EvoPort An Evolutionary Framework For Portfolio
9 pages
Thesis Prospectus Elan Markov
No ratings yet
Thesis Prospectus Elan Markov
3 pages
Machine Learning
No ratings yet
Machine Learning
29 pages
hPSO SA
No ratings yet
hPSO SA
29 pages
Level Set Based Path Planning Using A Novel Path Optimization Algorithm For Robots
No ratings yet
Level Set Based Path Planning Using A Novel Path Optimization Algorithm For Robots
8 pages
4-1 Data Science Syllabus
No ratings yet
4-1 Data Science Syllabus
7 pages
Nature and Scope Of: Managerial Economics
No ratings yet
Nature and Scope Of: Managerial Economics
33 pages
Linear Programming Module
100% (1)
Linear Programming Module
17 pages
Optimization of Dynamic Parameters For A Traction-Type Passenger Elevator Using A Dynamic Byte Coding Genetic Algorithm
No ratings yet
Optimization of Dynamic Parameters For A Traction-Type Passenger Elevator Using A Dynamic Byte Coding Genetic Algorithm
14 pages
REPORT On Intern Work
No ratings yet
REPORT On Intern Work
56 pages
A Computational Modelling Approach For L
No ratings yet
A Computational Modelling Approach For L
20 pages
Arena
No ratings yet
Arena
15 pages
Problem Formulation
No ratings yet
Problem Formulation
5 pages
Lecture Notes in Operations Research: Edited by Xiang-Sun Zhang De-Gang Liu Ling-Yun Wu
No ratings yet
Lecture Notes in Operations Research: Edited by Xiang-Sun Zhang De-Gang Liu Ling-Yun Wu
10 pages
A 5 S
No ratings yet
A 5 S
15 pages
Ds-Unit-1 LPP
No ratings yet
Ds-Unit-1 LPP
33 pages
Lec06 Greedy Algorithms (Part 1)
No ratings yet
Lec06 Greedy Algorithms (Part 1)
42 pages
A Study of Shape Parameterisation Methods For Airfoil Optimisation
No ratings yet
A Study of Shape Parameterisation Methods For Airfoil Optimisation
8 pages
INDE 513 hw1 Sol
No ratings yet
INDE 513 hw1 Sol
7 pages
On The Energy Optimized Control of Standard and High-Efficiency Induction Motors in CT and HVAC Applications
No ratings yet
On The Energy Optimized Control of Standard and High-Efficiency Induction Motors in CT and HVAC Applications
10 pages
A Fast and Elitist Multiobjective Genetic Algorithm: Nsga-Ii
No ratings yet
A Fast and Elitist Multiobjective Genetic Algorithm: Nsga-Ii
16 pages
Performance Based+Optimization+of+Structures+Theory+and+Applications
No ratings yet
Performance Based+Optimization+of+Structures+Theory+and+Applications
5 pages
Optimal User-Defined Directional Overcurrent Relay Coordination Considering Different Operating Modes of Microgrid
No ratings yet
Optimal User-Defined Directional Overcurrent Relay Coordination Considering Different Operating Modes of Microgrid
19 pages
SC Unit 4
No ratings yet
SC Unit 4
23 pages
Simplex Algorithm
No ratings yet
Simplex Algorithm
2 pages
Ansys Tutorial 3-Airfoil12
No ratings yet
Ansys Tutorial 3-Airfoil12
28 pages

Ch2Regression and Regularization1

Uploaded by

Ch2Regression and Regularization1

Uploaded by

Logistic Regression

Electrical and Computer Engineering

❖ Where X is m x (n+1) matrix and y m-dimensional column vector.

❖ Then matrix X called the design matrix will be:

Gradient Descent Normal Equation

❖ There are two conditions which cause non-invertibility to occur.

❖ And the parameters vector be 𝜃0 = −3, 𝜃1 = 1 𝑎𝑛𝑑 𝜃2 = 1

❖ Rewriting Cost(ℎ𝜃 𝑥 ,y) as one line equation:

❖ cost = 0 if y = 1, ℎ𝜃 𝑥 = 1 and this is where we want to be which implies if we

❖ Similarly, you can see that cost = 0 if y = 0, ℎ𝜃 𝑥 = 0 and this is where we

❖ Conjugate gradient, BFGS and L-BFGS are examples of more sophisticated

Size in feet^2 Size in feet^2 Size in feet^2

Age Age Age

Tumor Size Tumor Size Tumor Size

Slide credit: Andrew Ng

Size in feet^2 Size in feet^2

❖ Suppose we penalize and make 𝜃3 , 𝜃4 really small(approximately zero).

❖ Here is the idea behind regularization.

Slide credit: Andrew Ng

❖ λ here is called the regularization parameter and its main objective is to

❖Un-regularized linear regression

You might also like