0% found this document useful (0 votes)
51 views42 pages

Logistic Regression

Here are the key steps to implement L1 and L2 regularization using scikit-learn: 1. For L1 (Lasso) regularization, use LassoRegressor from sklearn.linear_model and specify the alpha parameter. 2. For L2 (Ridge) regularization, use RidgeRegressor from sklearn.linear_model and specify the alpha parameter. 3. Pre-process data and split into train and test. 4. Fit the model on training data. 5. Make predictions on test data and calculate accuracy. 6. Tune the alpha hyperparameter - low alpha can cause overfitting, high alpha can cause underfitting. Choose alpha that gives best test accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views42 pages

Logistic Regression

Here are the key steps to implement L1 and L2 regularization using scikit-learn: 1. For L1 (Lasso) regularization, use LassoRegressor from sklearn.linear_model and specify the alpha parameter. 2. For L2 (Ridge) regularization, use RidgeRegressor from sklearn.linear_model and specify the alpha parameter. 3. Pre-process data and split into train and test. 4. Fit the model on training data. 5. Make predictions on test data and calculate accuracy. 6. Tune the alpha hyperparameter - low alpha can cause overfitting, high alpha can cause underfitting. Choose alpha that gives best test accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

CLASSIFICATION

• So far the methods we have looked at are


aimed at predicting the value of a continuous
variable.
• When the dependent variable is discrete/
categorical, then it is no longer a regression
problem but a classification problem.
• The goal of classification is to place each
observation into a category defined by based
on a set of predictor variables .
• The following are some common examples of
where classification would be used:

– Trying to determine where to set the cut-off for


some diagnostic test (pregnancy tests, prostate or
breast cancer screening tests, etc...)
– Trying to determine if cancer has gone into
remission based on treatment and various other
indicators
– To categorize photos into distinct categories, a
multinomial classification model can be
developed.
Why can’t we use linear regression for classification?

• Let’s consider a simple binary classification


problem where y only takes two categories.
• Let’s say we want to classify whether a tumor is
malignant or benign

Let’s see what happens when we try to fit a


regression line
• Threshold classifier output at 0.5
If 0.5, predict = 1
If 0.5, predict = 0

Does it work?
What happens to the regression line when we get more samples?
Logistic Regression
What??

• Sometimes in machine learning, the dependent


variable is non-continous i.e. it falls within a finite set
of categories
• We may need a model that gives us a probability as
output e.g coin toss
• A logistic regression model is an efficient mechanism
for predicting the likelihood of an event or choice
being made
How does it work??

• The probability output of a logistic regression


function can be used in two ways:
– ‘As is’
– Converted to a binary category
• For example, let’s say we want to create a
logistic regression model to predict the
probability of getting heads in a coin toss.
• Let’s call that probability p(heads). If the
model predicts that p(heads) = 0.4, then after
n = 20 trials, heads = p(heads)*20 = 8 times.
• In many cases, you'll map the logistic
regression output into the solution to a binary
classification problem, in which the goal is to
correctly predict one of two possible labels e.g
spam or not spam
• Let’s say the model returns a 0.9995
probability that an email is spam, this means
that it’s highly likely that the email is spam….
• Likewise a probability of 0.003 would suggest
that it’s highly likely that the email is not
spam.
• However how do we determine the category
for an email with let’s say 0.45 probability of
being spam??
• Ans: We have to set a cutoff value for the
probability, a process known as thresholding.
• A logistic regression model ensures that it’s
output always falls between 0 and 1 by making
use of a sigmoid function..
′ 1
𝑦 = −𝑧
1+𝑒
• is the output of the logistic regression model for
a particular example

• values are the model’s learned weights and is


the bias.
• The values are feature values for a particular
example
• The sigmoid function yields the following plot

• is also defined as the log odds because it is the


log of the probability of the 1 label(e.g heads)
divided by the probability of the 0 label (e.g tails)
• Let’s say we have a logistic regression model
that was trained using three features and that
learned the following bias and weights
=1
=2
= -1
=5
Suppose the model is given the following features:
=0
= 10
=2
• Then
• And = 0.731
Cost function

• Linear regression used MSE as the loss


function

• What happens when we try to use the MSE


loss function for classification problems?
• Well, due to the non linear nature of the
sigma function, plotting the MSE function
against the weights for logistic regression
would result in a non convex graph and it
would be hard to use gradient descent to
converge at the global minima.
• For logistic regression, we use the following
cost function i.e. for binary classification
• Combining the two functions we get:

• So for n examples:

• So how can we obtain the optimal parameters


that give us the minimum value for our cost
function?
• The gradient descent algorithm works the
same way for logistic regression and linear
regression.
• Hence for , we have to determine the cost
function and differentiate it w.r.t each of the
weights (H/W)
Logistic Regression using sklearn
• There are mainly two ways of implementing
classification using sklearn
– SGD classifier class
– LogisticRegression class
• These two models aim to achieve the same goal but
using different optimization techniques
• For example using SGDClassifier(loss=‘log_loss’) will
result in a model equivalent to LogisticRegression
which is fitted via stochastic gradient descent.
• LogisticRegression uses gradient descent by
default.
• So for larger datasets, it would be better to
use SGDClassifier for two main reasons:
– Too many gradient descent steps are required
– Each gradient step is computationally expensive

Class presentation. Describe the differences


between gradient descent and stochastic
gradient descent.
Implementation
We are going to train a classifier using the
LogisticRegression class using the titanic dataset.

1. Importing required libraries


2. loading our data

3. Feature Engineering(one hot encoding)


4. Train our model

5. Making predictions
pd.get_dummies()
• pandas.get_dummies() is used for data
manipulation. It converts categorical data into
dummy or indicator variables.
• Machine learning models cannot interpret
categorical data hence it needs to be
translated to numerical data.
• This process is referred to as dummy variable
encoding
• In our previous example a random sample of
five rows looks like this

• Then after dummy variable encoding…


• By looking at our dummy encoded dataframe,
we can see that there is a lot of redundancy
• Our data can be represented using fewer
columns using the drop_first parameter
Regularization
• A common problem that one can encounter when
training models is the problem of over fitting.
• Overfitting is a modeling error that occurs when our
models fits exactly or too closely to our training data
• As a result an overfit model performs well on the
training data but not as well on unseen examples.

[It’s like memorising past exam questions and failing


the exam loool]
• Underfitting occurs when our model fails to
capture the relationship between the input
and output variables resulting in a high error
rate on both the training set and unseen
examples
• Regularization does not improve the
performance of our model on the training
dataset however our model fits well to unseen
examples
• The following three strategies are used to
reduce model complexity:
– Early stopping i.e. limiting the learning rate
– L1 regularization
– L2 regularization
• A linear regression model that implements L1
norm for regularisation is called lasso
regression, and one that implements
(squared) L2 norm for regularisation is
called ridge regression. To implement these
two, note that the linear regression model
stays the same:

• but it is the calculation of the loss function


that includes these regularisation terms i.e
• with L1 regularization.

• with L2 regularization.

• Therefore apart from minimizing the error between


and the optimization algorithm now has to reduce the
regularization terms in order to minimize the cost
function

• Lets look at how L1 and L2 regression work using a


simple linear regression model
• To demonstrate the effect of L1 and L2
regularization, let’s fit our model using three
different loss functions:
– L (with no regularization)
– L1
– L2
• With no regularization,

NB: we are assuming that our model will be


overfitted using this loss function
• With L1 regularization:

• With L2 regularization:

• If you recall, we said using gradient descent:

• Substituting L, L1 and L2 we get:


• L:

• L1:
• L2:

Let and = 1, then:


L:
L1:
L2:
• Assuming that equation 0 gives us ω value
that leads to overfitting, then equations 1.1,
1.2 and 2 will reduces the chances of
overfitting by shifting away from that value
• L1 regularization helps in feature selection by
eliminating the features that are less
important i.e. we will be left with a smaller
number of features that explain most of the
variance
• Variance is the amount that the estimate of
the target function will vary given different
portions of the training dataset.
• L2 regularization seeks to reduce the chances
of overfitting by forcing the weights to be very
close to zero (but not zero)
• Now let’s look at how to implement L1 and L2
regression using sklearn
Sklearn.linear_model.Ridge
• this is an extension of LinearRegression() that
has been modified by a penalty parameter
that is equivalent to the square of the
magnitude of the coefficients i.e.
Loss function = OLS + alpha * summation (squared coefficient values)

• Our job is to select the alpha. A low alpha


value can lead to overfitting whereas a high
alpha value can lead to underfitting.
1. Importing libraries

2. Preprocessing and loading our data


3. Train the model and make a prediction

4. View the calculated weights and MSE

Change the value for alpha and see how it affects


the weights.
Now try Lasso regression ;-)

You might also like