Unit 2
Unit 2
Logistic regression
What is logistic regression? Logistic regression is a statistical method used for binary classification. It
predicts the probability that a given input belongs to a particular category. The output is a
probability value between 0 and 1, which is then mapped to two classes (e.g., 0 or 1).
Logistic regression uses the logistic function (also known as the sigmoid function) to model the
probability of the default class. The logistic function is defined as:
P(X)=1/1+e−(β0 +β1 X)
where:
( P(X) ) is the probability of the outcome being 1 (or the positive class).
Scenario: We want to predict whether a student will pass an exam based on the number of hours
they studied.
Data: We have data from 20 students, including the number of hours they studied and whether they
passed the exam (Pass = 1, Fail = 0).
Hours Studied
0 0
1 0
2 0
3 0
4 1
Steps:
1.Model Specification:
2.Parameter Estimation:
Using the data, we estimate the parameters ( \beta_0 ) and ( \beta_1 ). Let’s assume we get:
( \beta_0 = -4 )
( \beta_1 = 1 )
3.Prediction:
P(Pass)=1+e−(−4+1⋅4)1=1+e01=21=0.5
Interpretation:
Coefficients: ( \beta_0 ) is the intercept, and ( \beta_1 ) is the coefficient for the number of hours
studied.
Probability: The logistic function transforms the linear combination of inputs into a probability
between 0 and 1.
Ridge regression is a regularization technique used in linear regression. Here are the key
points:
1. Objective:
o Ridge regression aims to prevent overfitting by adding a penalty term to the
ordinary least squares (OLS) cost function.
o The penalty term encourages the model to have smaller coefficient values.
2. Cost Function:
o The ridge cost function is:
Cost=RSS(w)+λj=1∑Dwj2
Problem Scenario:
Cost=RSS(w)+λj=1∑Dwj2
Implementation Example:
1. Objective:
o Lasso aims to find a balance between model simplicity and accuracy.
o It adds a penalty term to the traditional linear regression model.
2. How It Works:
o The linear regression equation is:
Lasso regression adds a penalty term to the linear regression cost function:
Cost=RSS(w)+λj=1∑D∣wj∣
Feature Importance:
Adds a penalty term proportional to the Adds a penalty term proportional to the
sum of squared coefficients sum of absolute values of coefficients
Suitable when all features are Suitable when some features are irrelevant
importantly or redundant
Performs better when there are many Performs better when there are a few large
small to medium-sized coefficients coefficients
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more than
the required data points present in the given dataset. Because of this, the model starts caching noise
and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy
of the model.
Example: The concept of the overfitting can be understood by the below graph of
the linear regression output:
Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.
In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.
Example: We can understand the underfitting using below output of the linear
regression model:
As we can see from the above diagram, the model is unable to capture the data points
present in the plot.
Standardization
Data collection
Our data can be in various formats i.e., numbers (integers) & words (strings), for now, we’ll consider
only the numbers in our Dataset.
Assume our dataset has random numeric values in the range of 1 to 95,000 (in random order). Just
for our understanding consider a small Dataset of barely 10 values with numbers in the given range
and randomized order.
1) 99
2) 789
3) 1
4) 541
5) 5
6) 6589
7) 94142
8) 7
9) 50826
10) 35464
If we just look at these values, their range is so high, that while training the model with 10,000 such
values will take lot of time. That’s where the problem arises.
Understanding standardization
We have a solution to solve the problem arisen i.e. Standardization. It helps us solve this by :
Down Scaling the Values to a scale common to all, usually in the range -1 to +1.
So, how do we do that? we’ll there’s a mathematical formula for the same i.e., Z-Score =
(Current_value – Mean) / Standard Deviation.
Using this formula we are replacing all the input values by the Z-Score for each and every value.
Hence we get values ranging from -1 to +1, keeping the range intact.
It’s pretty obvious for Mean = 0 and S.D = 1 as all the values will have such less difference and each
value will nearly be equal 0, hence Mean = 0 and S.D. = 1.