0% found this document useful (0 votes)
13 views8 pages

Unit 2

Uploaded by

apdeshmukh371122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

Unit 2

Uploaded by

apdeshmukh371122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Linear regression

Logistic regression
What is logistic regression? Logistic regression is a statistical method used for binary classification. It
predicts the probability that a given input belongs to a particular category. The output is a
probability value between 0 and 1, which is then mapped to two classes (e.g., 0 or 1).

How does logistic regression work?

Logistic regression uses the logistic function (also known as the sigmoid function) to model the
probability of the default class. The logistic function is defined as:

The logistic function is defined as:

P(X)=1/1+e−(β0 +β1 X)

where:

( P(X) ) is the probability of the outcome being 1 (or the positive class).

( \beta_0 ) is the intercept.

( \beta_1 ) is the coefficient for the predictor ( X ).

( e ) is the base of the natural logarithm.

Example: Predicting Whether a Student Passes an Exam

Scenario: We want to predict whether a student will pass an exam based on the number of hours
they studied.

Data: We have data from 20 students, including the number of hours they studied and whether they
passed the exam (Pass = 1, Fail = 0).

Hours Studied

Hours Studied Pass (1) / Fail (0)

0 0

1 0
2 0

3 0

4 1

Steps:

1.Model Specification:

We use the logistic regression model:

P(Pass)=1+e−(β0 +β1 ⋅Hours Studied)1​

2.Parameter Estimation:

Using the data, we estimate the parameters ( \beta_0 ) and ( \beta_1 ). Let’s assume we get:

( \beta_0 = -4 )

( \beta_1 = 1 )

3.Prediction:

To predict the probability of passing for a student who studied 4 hours:

P(Pass)=1+e−(−4+1⋅4)1​=1+e01​=21​=0.5

So, the probability of passing is 0.5 or 50%.

Interpretation:

Coefficients: ( \beta_0 ) is the intercept, and ( \beta_1 ) is the coefficient for the number of hours
studied.

Probability: The logistic function transforms the linear combination of inputs into a probability
between 0 and 1.

Ridge regression is a regularization technique used in linear regression. Here are the key
points:

1. Objective:
o Ridge regression aims to prevent overfitting by adding a penalty term to the
ordinary least squares (OLS) cost function.
o The penalty term encourages the model to have smaller coefficient values.
2. Cost Function:
o The ridge cost function is:

Cost=RSS(w)+λj=1∑D​wj2​

 RSS(w) represents the residual sum of squares (similar to OLS).


 wj​ are the regression coefficients.
 λ controls the strength of regularization (higher λ means stronger
regularization).
3. Effect on Coefficients:
o Ridge regression shrinks the coefficients towards zero.
o It doesn’t force any coefficient to be exactly zero (unlike Lasso).
o All features are retained, but their impact is reduced.
4. Benefits:
o Helps stabilize model performance when dealing with multicollinearity
(highly correlated features).
o Reduces the risk of overfitting.
5. Trade-off:
o Choosing the right λ (hyperparameter) balances model fit and regularization.
o Cross-validation is often used to find the optimal value.

Problem Scenario:

6. Imagine we’re analyzing a supply chain delivery dataset.


7. Long-distance deliveries often contain a high number of items, while short-
distance deliveries have smaller inventories.
8. Delivery distance and item quantity are linearly correlated

Ridge Regression Twist:

 Ridge regression adds a penalty term to the cost function:

Cost=RSS(w)+λj=1∑D​wj2​

o RSS(w) represents the residual sum of squares (similar to OLS).


o wj​ are the regression coefficients.
o λ controls the strength of regularization (higher λ means stronger
regularization).

Benefits of Ridge Regression:

 Stabilizes the model by shrinking coefficients.


 Reduces overfitting by discouraging large coefficient values.
 Retains all features but with smaller impact.

Implementation Example:

 Suppose we have a dataset with “YearsExperience” and “Salary” columns.


 We’ll train a ridge regression model to predict salaries based on experience.
 You can implement ridge regression from scratch in Python or use libraries like
scikit-learn.

Lasso regression, also known as L1 regularization, is a powerful technique used in


statistical modeling and machine learning. Here’s what you need to know:

1. Objective:
o Lasso aims to find a balance between model simplicity and accuracy.
o It adds a penalty term to the traditional linear regression model.
2. How It Works:
o The linear regression equation is:

y=β0 +β1 x1​+β2 x2​+…+βp​xp +ε

y is the dependent variable (target).


β0 ,β1 ,…,βp​ are coefficients (parameters).
o Lasso encourages sparse solutions by forcing some coefficients to be exactly
zero.
o It automatically identifies and discards irrelevant or redundant variables.
3. Use Cases:
o Feature selection: Lasso helps choose relevant features.
o High multicollinearity: Useful when features are highly correlated.
o Automation of model selection.
4. Problem Scenario:
o Imagine you’re trying to predict house prices based on features such
as location, square footage, and the number of bedrooms.

Lasso Regression Twist:

 Lasso regression adds a penalty term to the linear regression cost function:

Cost=RSS(w)+λj=1∑D​∣wj​∣

o RSS(w) represents the residual sum of squares (similar to OLS).


o wj​ are the regression coefficients.
o λ controls the strength of regularization (higher λ means stronger
regularization).
 Lasso encourages sparse models by forcing some coefficients to be exactly zero.

Feature Importance:

 Lasso helps identify which features are more important.


 In our example, it might reveal that location and square footage play a major role in
determining house prices.
 Difference between Ridge and Lasso Regression

Ridge Regression Lasso Regression


and Encourages some coefficients to be
Shrinks the coefficients toward zero
exactly zero

Adds a penalty term proportional to the Adds a penalty term proportional to the
sum of squared coefficients sum of absolute values of coefficients

Does not eliminate any features Can eliminate some features

Suitable when all features are Suitable when some features are irrelevant
importantly or redundant

More computationally efficient Less computationally efficient

Requires setting a hyperparameter Requires setting a hyperparameter

Performs better when there are many Performs better when there are a few large
small to medium-sized coefficients coefficients

Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more than
the required data points present in the given dataset. Because of this, the model starts caching noise
and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy
of the model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of
the linear regression output:
Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.

In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear
regression model:
As we can see from the above diagram, the model is unable to capture the data points
present in the plot.

Standardization

The steps to be followed are :

Data collection

Our data can be in various formats i.e., numbers (integers) & words (strings), for now, we’ll consider
only the numbers in our Dataset.

Assume our dataset has random numeric values in the range of 1 to 95,000 (in random order). Just
for our understanding consider a small Dataset of barely 10 values with numbers in the given range
and randomized order.

1) 99

2) 789

3) 1

4) 541

5) 5

6) 6589

7) 94142
8) 7

9) 50826

10) 35464

If we just look at these values, their range is so high, that while training the model with 10,000 such
values will take lot of time. That’s where the problem arises.

Understanding standardization

We have a solution to solve the problem arisen i.e. Standardization. It helps us solve this by :

Down Scaling the Values to a scale common to all, usually in the range -1 to +1.

And keeping the Range between the values intact.

So, how do we do that? we’ll there’s a mathematical formula for the same i.e., Z-Score =
(Current_value – Mean) / Standard Deviation.

Using this formula we are replacing all the input values by the Z-Score for each and every value.
Hence we get values ranging from -1 to +1, keeping the range intact.

Standardization performs the following:

Converts the Mean (μ) to 0

Converts to S.D. (σ) to 1

It’s pretty obvious for Mean = 0 and S.D = 1 as all the values will have such less difference and each
value will nearly be equal 0, hence Mean = 0 and S.D. = 1.

You might also like