0% found this document useful (0 votes)
5 views

Lecture ai

The document covers logistic regression, focusing on its application as a binary classifier, the formulation of its likelihood, and the derivation of its gradient and Hessian. It discusses the importance of regularization to prevent overfitting in models, emphasizing methods to reduce the number of features or penalize model parameters. Additionally, it outlines the cost function used in logistic regression, which is the cross-entropy loss, and provides insights into the implications of regularization on model performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture ai

The document covers logistic regression, focusing on its application as a binary classifier, the formulation of its likelihood, and the derivation of its gradient and Hessian. It discusses the importance of regularization to prevent overfitting in models, emphasizing methods to reduce the number of features or penalize model parameters. Additionally, it outlines the cost function used in logistic regression, which is the cross-entropy loss, and provides insights into the implications of regularization on model performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Logistic Regression - Regularization

Mourad Gridach
Department of Computer Science
High Institute of Technology - Agadir
Last Lecture

• Gradient descent for linear regression

• Linear regression with multiple variable

2
Today’s Lecture
• We will cover classification

• Binary classifiers using a technique called Logistic Regression:

– Apply logistic regression to discriminate between two classes.

– Formulate the logistic regression likelihood.

– Derive the gradient and Hessian of logistic regression.

– How to do logistic regression with the softmax link.

3
Notes on Notation
• In linear regression we used w to refer to weights or
parameters of the model

• In this lecture, we will use θ to refer to model


parameters since it’s the most common notation in
probability theory.

4
MucCulloch-Pitts Model of Neuron

5
Sigmoid Function

• We want

• Logistic regression:

• Where :

• “f ” called Sigmoid function or also “Logistic function”

• Question: what are the max and min


values that can take this function ?
Example of Data in Logistic Regression

X1 x2 y
34.1 10.12 0
30.11 43.21 1
35.1 72.12 1
60.2 86.78 1
79.23 75.23 0
45.08 96.67 1
75.89 46.23 0
Applications

• Spam vs Not Spam

• Sentiment classification

• Medical diagnosis

8
Probabilistic Interpretation
• Logistic regression:
Probabilistic Interpretation
• Logistic regression:
Logistic Regression Hypothesis

Formula:
Linear Separating Hyper-planes

• is an equation of the plane


12
Next
what is the Cost/Loss Function?

13
Bernoulli Distribution: a model of coins

• A Bernoulli random variable r.v. X takes values in {0,1}

• Where θ ∈ [0,1]. We can write this probability as follows:

θ if x=1

1–θ if x=0
14
Entropy
• In information theory, entropy H is a measure of the uncertainty
associated with a random variable. It is defined as:

• For a Bernoulli variable X, the entropy is:

15
Logistic Regression
• The logistic regression model specifies the probability of a binary output
yi∈{0, 1} given the input xi as follows:

16
Logistic Regression – Cost Function

Cross-Entropy loss will be the cost function


Let us prove that

17
(Binary) Cross Entropy Loss - Intuition
Summary
Correct answer à Loss

Wrong answer à Loss

y Y_pred Compute loss


1 0.2
1 0.8
0 0.1
0 0.9
Gradient of binary logistic regression – Exercise

Cost function

Prove that the gradient of the loss is:

19
Hessian of binary logistic regression - Exercise

Cost function

Prove that the Hessian (second derivative) of logistic regression is:

• One can show that H is positive definite;


• Therefore, the NLL is convex à has a unique global minimum.
20
Summary so far
Hypothesis of a logistic regression model:

OR

Cost/Loss/Objective function called Cross-Entropy Loss:

21
Regularization :

The Problem of Overfitting

22
Training vs. Testing

• Students vs Exams

• Training: Students learn new concepts during the


lectures à Training Data (Training set)

• Testing : Professor tests them during exams à Test Data


(Test set)
23
Linear Regression Revisited
Output (y)

Output (y)

Output (y)
Input (x) Input (x) Input (x)

• See white board

24
Linear Regression Revisited
Output (y)

Output (y)

Output (y)
Input (x) Input (x) Input (x)

Overfitting: If we have too many features, the learned hypothesis may fit
the training set very well
but fail to generalize to new examples (predict prices on new examples).
25
Logistic Regression and Overfitting

26
Underfitting

27
Overfitting

28
Best Solution

29
How to Solve Overfitting
1. Reduce number of features.
– Manually select which features to keep.
– Model selection algorithm(out of the scope).

2. Regularization J
– Keep all the features, but reduce magnitude/values of parameters .
– Works well when we have a lot of features, each of which
contributes a bit to predicting .

30
So let us apply the second solution:

Regularization

31
Intuition

Output (y)
Output (y)

Input (x) Input (x)

32
Intuition

Output (y)
Output (y)

Input (x) Input (x)

• Suppose we penalize θ3 and θ4 and make , really small by :

33
Regularization
• Small values for parameters
– “Simpler” hypothesis
– Smooth function
– Less prone to Overfitting
• In the last example, we penalize θ3 and θ4
• Let us take the last example of car prices
ü Features :
ü Parameters :
• Question: how to choose which parameters to penalize ?
34
Regularization – General Mathematical Formula
• How to choose which parameters to penalize ?

• Solution : shrink all the parameters without focusing on specific


ones

• We will try to keep all parameters smaller

• The general cost function will be :

35
Remark
• What if λ (the regularization term) is very large (λ=1020) ?
• Let us take this example with 5 parameters

– θ1 ≈ θ2≈ θ3 ≈ θ4 ≈ 0
θ0 – What will happen ?
Output (y)

underfitting
– Answer :
Input (x)

36
Regularization for Linear/Logistic Regression

• The new Linear Regression cost function after adding the regularization term
will be :

• Recall: the difference will be just in the hypothesis and the cost
function J(θ) where the gradient descent formula will be the same
37
Gradient Descent in Action
• Add derivative of the regularization term to gradient descent for Linear
regression

Repeat {

} For j=1, …, n

38
Summary

• Logistic Regression

• Deep understanding of Classification problems

• Hypothesis for Logistic regression

• Cost function for Logistic regression

• Regularization problem
39
Questions

40

You might also like