Linear Regression
Linear Regression
Linear Regression
1
Introduction to machine learning
b. The term "linear" in the name “linear regression” refers to the fact that the
method models data with linear combination of the explanatory variables.
d. In the case of linear regression with a single explanatory variable, the linear
combination used in linear regression can be expressed as:
e. In its most basic form fits a straight line to the response variable. The model is
2 designed to fit a line that minimizes the squared differences (also called errors
Introduction to machine learning
Linear Regression Models -
3
Introduction to machine learning
Linear Regression Models -
d. Coefficient of relation - Pearson’s coefficient p(x,y) = Cov(x,y) / ( stnd Dev (x) X stnd
Dev (y) )
e. Generating linear model for cases where r is near 0, makes no sense. The model will
not be reliable. For a given value of X, there can be many values of Y! Nonlinear
models may be better in such cases
4
Introduction to machine learning
Linear Regression Models (Recap) -
f. Coefficient of relation - Pearson’s coefficient p(x,y) = Cov(x,y) / ( stnd Dev (x) X stnd
Dev (y) )
- ve +ve
quad quad
+ve - ve
quad quad
=0
>0
http://www.socscistatistics.com/tests/pearson/Default2.aspx
5
Introduction to machine learning
Linear Regression Models -
g. Given Y = f(x) and the scatter plot shows apparent correlation between X and Y
Let’s fit a line into the scatter which shall be our model
h. But there are infinite number of lines that can be fit in the scatter. Which one
should we consider as the model?
i. This and many other
algorithms use gradient
descent or variants of
gradient descent method
for finding the best
model
j. Gradient descent
methods use partial
derivatives on the
parameters (slope and
intercept) to minimize
sum of squared errors
6
Introduction to machine learning
Error = (T – (mx + C)
Sum of all errors can cancel
out and give 0
7
Introduction to machine learning
Linear Regression Models -
n. Coefficient of determinant – determines the fitness of a linear model. The closer the
points get to the line, the R^2 (coeff of determinant) tends to 1, the better the model is
Ybar
Xbar
8
Introduction to machine learning
Linear Regression Models -
o. Coefficient of determinant (Contd…)
I. There are a variety of errors for all those points that don’t fall exactly on the line.
II. It is important to understand these errors to judge the goodness of fit of the model i.e.
How representative the model is likely to be in general
III. Let us look at point P1 which is one of the given data points and associated errors due to
the model
1. P1 – Original y data point for given x
SSE
4. SST – Sum of Square error Total (SST)
SST
P2 Variance of P1 from Ybar (Y – Ybar)^2
SSR
Ybar 5. SSR - Regression error (p2 – ybar)^2 (portion
SST captured by regression model)
Xbar x
9
Introduction to machine learning
Linear Regression Models -
SSE
SST 2. Hence SSR should be equal to SST i.e.
P2 SSR/SST should be 1.
SSR
Ybar
3. Poor fit will mean large SSE. SSR/SST will
be close to 0
Point B
Point B
Point A Point A
In case of point “A”, the line explains the variance of the point
Whereas point “B” the is a small area (light grey) which the line does not represent.
11
Introduction to machine learning
1. Assumption of linearity. assumes a linear relation between the dependent / target variable
and the independent / predictor variables.
12
Introduction to machine learning
13
Introduction to machine learning
Advantages –
1. Simple to implement and easier to interpret the outputs coefficients
Disadvantages -
2. Assumes a linear relationships between dependent and independent variables. That
is, it assumes there is a straight-line relationship between them
3. Outliers can have huge effects on the regression
4. Linear regression assume independence between attributes
5. Linear regression looks at a relationship between the mean of the dependent variable
and the independent variables.
6. Just as the mean is not a complete description of a single variable, linear regression
is not a complete description of relationships among variables
7. Boundaries are linear
14
Introduction to machine learning
The dataset has 9 attributes listed below that define the quality
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
15
Introduction to machine learning
ThankYou
16