Unit 3 Notes
Unit 3 Notes
INTRODUCTION
Apple
Orange
Apple
Orange
Apple
1. Linear Regression
1. Linear Regression
UNIT 3
Linear Regression
1. Problem Statement
2. Assumptions
3. Equations
4. Cost Function
5. Gradient Descent
6. Multi Variate Linear Regression
7. Bayesian Linear Regression
Problem Statement
● The goal of linear regression is to find the best fit line that describes the
relationship between the dependent variable and the independent variable(s).
The best fit line is the one that minimizes the distance between the predicted
values and the actual values of the dependent variable.
Assumptions
1. Linearity
2. Independence
3. Homoscedasticity
4. Normality
5. No Multicollinearity
Assumptions in Linear Regression
1. Linearity: The relationship between the dependent variable and the independent variable(s) is linear.
This means that as the value of the independent variable changes, the value of the dependent variable
changes proportionally. For example, the relationship between the number of hours worked and the
amount of money earned should be linear.
2. Independence: The observations are independent of each other. In other words, the value of the
dependent variable for one observation does not affect the value of the dependent variable for another
observation. For example, the height of one person should not affect the height of another person.
3. Homoscedasticity: The variance of the errors is constant across all levels of the independent
variable(s). This means that the spread of the residuals is the same for all values of the independent
variable. For example, the variance of the errors in a model that predicts house prices should be the
same for houses of different sizes.
4. Normality: The errors are normally distributed. This means that the distribution of the residuals should
be symmetric around zero, with most of the residuals falling close to zero and fewer residuals falling
farther away from zero. For example, the distribution of the residuals in a model that predicts student
test scores should be normal.
5. No Multicollinearity: The independent variables are not highly correlated with each other. This means
that there should be no strong linear relationships among the independent variables. For example, in a
model that predicts employee salaries, there should be no strong correlation between the years of
education and the number of years of work experience.
Equations
Linear regression is a widely used statistical technique to model the relationship between a dependent variable
and one or more independent variables.
● The goal of linear regression is to find the best fit line that describes the relationship between the
dependent variable and the independent variable(s). The best fit line is the one that minimizes the
distance between the predicted values and the actual values of the dependent variable.
● The dependent variable is the outcome variable, while the independent variable(s) are the predictor
variable(s).
● The parameters of the linear regression model are estimated using the method of least squares. The
method of least squares minimizes the sum of the squared errors between the predicted and actual
values of the dependent variable.
● Linear regression can be used for both simple linear regression, where there is only one independent
variable, and multiple linear regression, where there are more than one independent variables.
● Linear regression can be used for both continuous and categorical independent variables. However,
categorical variables need to be encoded as dummy variables.
● In simple linear regression, the relationship between the dependent variable and the independent
variable is modeled using a straight line. In multiple linear regression, the relationship between the
dependent variable and the independent variables is modeled using a plane or a hyperplane.
Cost Function
Mean Squared Error
[1]
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for
estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference
between the estimated values and the actual value
How to find best values for m and b?
Gradient Descent
Key Points of Gradient Descent
Gradient descent is a popular optimization algorithm used to find the optimal values of the parameters in
a linear regression model.
The objective of linear regression is to find the best fit line that describes the relationship between the
independent variable(s) and the dependent variable.
The best fit line is characterized by the parameters, which are estimated using the method of least
squares. However, finding the optimal values of the parameters can be computationally expensive,
especially when the dataset is large.
● Generally, when it comes to multivariate linear regression, we don't throw in all the independent
variables at a time and start minimizing the error function.
● First one should focus on selecting the best possible independent variables that contribute well to
the dependent variable.
● For this, we go on and construct a correlation matrix for all the independent variables and the
dependent variable from the observed data.
● The correlation value gives us an idea about which variable is significant and by what factor.
● From this matrix we pick independent variables in decreasing order of correlation value and run the
regression model to estimate the coefficients by minimizing the error function.
● We stop when there is no prominent improvement in the estimation function by inclusion of the next
independent feature.
● This method can still get complicated when there are large no.of independent features that have
significant contribution in deciding our dependent variable
Key points
The variable we want to predict is called the Dependent Variable, while those used to
calculate the dependent variable are termed as Independent Variables.