0% found this document useful (0 votes)
15 views

Unit 3 Notes

Linear regression is a statistical technique to model the relationship between a dependent variable and one or more independent variables. The goal is to find the best fit line that minimizes the difference between predicted and actual dependent variable values. Key assumptions include linearity, independence, homoscedasticity, normality, and no multicollinearity. The cost function, typically mean squared error, is minimized using gradient descent to estimate optimal parameter values for the regression model. Multivariate linear regression extends this to modeling relationships between a dependent variable and multiple independent variables.

Uploaded by

sjanani.bme2021
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Unit 3 Notes

Linear regression is a statistical technique to model the relationship between a dependent variable and one or more independent variables. The goal is to find the best fit line that minimizes the difference between predicted and actual dependent variable values. Key assumptions include linearity, independence, homoscedasticity, normality, and no multicollinearity. The cost function, typically mean squared error, is minimized using gradient descent to estimate optimal parameter values for the regression model. Multivariate linear regression extends this to modeling relationships between a dependent variable and multiple independent variables.

Uploaded by

sjanani.bme2021
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT 3

INTRODUCTION
Apple

Orange

Apple

Orange

Apple
1. Linear Regression
1. Linear Regression
UNIT 3
Linear Regression

1. Problem Statement
2. Assumptions
3. Equations
4. Cost Function
5. Gradient Descent
6. Multi Variate Linear Regression
7. Bayesian Linear Regression
Problem Statement

Linear regression is a widely used statistical technique to model the relationship


between a dependent variable and one or more independent variables.

● The goal of linear regression is to find the best fit line that describes the
relationship between the dependent variable and the independent variable(s).
The best fit line is the one that minimizes the distance between the predicted
values and the actual values of the dependent variable.
Assumptions

1. Linearity
2. Independence
3. Homoscedasticity
4. Normality
5. No Multicollinearity
Assumptions in Linear Regression
1. Linearity: The relationship between the dependent variable and the independent variable(s) is linear.
This means that as the value of the independent variable changes, the value of the dependent variable
changes proportionally. For example, the relationship between the number of hours worked and the
amount of money earned should be linear.
2. Independence: The observations are independent of each other. In other words, the value of the
dependent variable for one observation does not affect the value of the dependent variable for another
observation. For example, the height of one person should not affect the height of another person.
3. Homoscedasticity: The variance of the errors is constant across all levels of the independent
variable(s). This means that the spread of the residuals is the same for all values of the independent
variable. For example, the variance of the errors in a model that predicts house prices should be the
same for houses of different sizes.
4. Normality: The errors are normally distributed. This means that the distribution of the residuals should
be symmetric around zero, with most of the residuals falling close to zero and fewer residuals falling
farther away from zero. For example, the distribution of the residuals in a model that predicts student
test scores should be normal.
5. No Multicollinearity: The independent variables are not highly correlated with each other. This means
that there should be no strong linear relationships among the independent variables. For example, in a
model that predicts employee salaries, there should be no strong correlation between the years of
education and the number of years of work experience.
Equations

● Observed Value - Data specified in the given dataset.


● Expected Value - Value calculated using regression line. Hours Marks
● Residual - Difference between observed value and expected value. of study
● Need of squaring - Because when expected is less than observed value —--->
overall reduction in error.
● Slope - a slope of a line is the change in y coordinate with respect to the 1 15
change in x coordinate. 2 70
● Intercept - The point where the line or curve crosses the axis of the graph is
called intercept. If a point crosses the x-axis, then it is called the x-intercept. 4 60
If a point crosses the y-axis, then it is called the y-intercept
8 95
Key points of Linear Regression

Linear regression is a widely used statistical technique to model the relationship between a dependent variable
and one or more independent variables.

● The goal of linear regression is to find the best fit line that describes the relationship between the
dependent variable and the independent variable(s). The best fit line is the one that minimizes the
distance between the predicted values and the actual values of the dependent variable.
● The dependent variable is the outcome variable, while the independent variable(s) are the predictor
variable(s).
● The parameters of the linear regression model are estimated using the method of least squares. The
method of least squares minimizes the sum of the squared errors between the predicted and actual
values of the dependent variable.
● Linear regression can be used for both simple linear regression, where there is only one independent
variable, and multiple linear regression, where there are more than one independent variables.
● Linear regression can be used for both continuous and categorical independent variables. However,
categorical variables need to be encoded as dummy variables.
● In simple linear regression, the relationship between the dependent variable and the independent
variable is modeled using a straight line. In multiple linear regression, the relationship between the
dependent variable and the independent variables is modeled using a plane or a hyperplane.
Cost Function
Mean Squared Error

[1]
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for
estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference
between the estimated values and the actual value
How to find best values for m and b?
Gradient Descent
Key Points of Gradient Descent
Gradient descent is a popular optimization algorithm used to find the optimal values of the parameters in
a linear regression model.

The objective of linear regression is to find the best fit line that describes the relationship between the
independent variable(s) and the dependent variable.

The best fit line is characterized by the parameters, which are estimated using the method of least
squares. However, finding the optimal values of the parameters can be computationally expensive,
especially when the dataset is large.

Gradient descent is a useful technique to find the optimal parameters efficiently.


Key Points of Gradient Descent - Continued

1. Initialize the parameters: Start by initializing the parameters (weights) to small


random values.
2. Compute the cost function: The cost function measures the difference between
the predicted and actual values of the dependent variable. In linear regression, the
cost function is typically the mean squared error (MSE) or the sum of squared
errors (SSE). Compute the cost function using the current values of the
parameters.
3. Compute the gradients: The gradients are the partial derivatives of the cost
function with respect to the parameters. Compute the gradients using the current
values of the parameters.
4. Update the parameters: Update the parameters by subtracting the gradients
multiplied by the learning rate. The learning rate determines how big the steps are
in the direction of the steepest descent. The learning rate is a hyperparameter that
needs to be tuned.
5. Repeat steps 2 to 4: Repeat steps 2 to 4 until the cost function converges or until a
maximum number of iterations is reached.
Multi Variate Regression

Area Size Number of Price of House


Bedrooms

Tambaram 1200 sqft 2 50 lakhs

Egmore 1000 sqft 1 80 lakhs

Velachery 2000 sqft 3 1 crore

Chengalpattu 2400 sqft 2 60 lakhs


Computing Parameters

● Generally, when it comes to multivariate linear regression, we don't throw in all the independent
variables at a time and start minimizing the error function.
● First one should focus on selecting the best possible independent variables that contribute well to
the dependent variable.
● For this, we go on and construct a correlation matrix for all the independent variables and the
dependent variable from the observed data.
● The correlation value gives us an idea about which variable is significant and by what factor.
● From this matrix we pick independent variables in decreasing order of correlation value and run the
regression model to estimate the coefficients by minimizing the error function.
● We stop when there is no prominent improvement in the estimation function by inclusion of the next
independent feature.
● This method can still get complicated when there are large no.of independent features that have
significant contribution in deciding our dependent variable
Key points

Multivariate regression is an extension of simple linear regression. It is used when we


want to predict the value of a variable based on the value of two or more different
variables.

The variable we want to predict is called the Dependent Variable, while those used to
calculate the dependent variable are termed as Independent Variables.

You might also like