Lecture6 Regression
Lecture6 Regression
Lecture 6
What is Regression
• Regression analysis is a statistical method that helps us to analyze and
understand the relationship between two or more variables of interest.
• It helps to understand which factors are important, which factors can be
ignored, and how they are influencing each other. In other words analyze
the specific relationships between the independent variables and the
dependent variable.
• In regression, we normally have one dependent variable and one or more
independent variables. Forecast the value of a dependent variable (Y) from
the value of independent variables (X1, X2,…Xk.).
• We try to “regress” the value of dependent variable “Y” with the help of
the independent variables.
Types of Regression approaches
• There are many types of regression approaches we will study some of
them here
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector for Regression (SVR)
• Decision Tree Regression
• Random Forest Regression
Simple linear regression
• In statistics, simple linear regression is a linear regression model with
a single explanatory variable.
• It concerns two-dimensional sample points with one independent
variable and one dependent variable (conventionally, the x and y
coordinates in a Cartesian coordinate system)
• It finds a linear function (a non-vertical straight line) that, as
accurately as possible, predicts the dependent variable values as a
function of the independent variable.
• Simply put, Simple linear regression is used to estimate the
relationship between two quantitative variables.
What can simple linear regression be used
for?
• You can use simple linear regression when you want to know:
• How strong the relationship is between two variables (e.g. the relationship
between rainfall and soil erosion).
• The value of the dependent variable at a certain value of the independent
variable (e.g. the amount of soil erosion at a certain level of rainfall).
Model for simple linear regression
• Consider the equation of line given as,
Yˆ b0 b1 X
• Where y is the dependent variable, x is the independent variable, b0
is the y-intercept and b1 is the slope of the line.
• We need to find b0 and b1 to estimate y using x , such that the error Ɛ
between the predicted value of y and original value of y.
is minimized
The Model
House
Cost
ˆ
Y b0 b1 Xcosts a b o ut
o u se t. S ize)
n g a h re foo + 75(
u ildi squa 5000
B p er =2
5 st
$7 se co
ˆ Most lots sell
Y b0 for b$25,000 Ho u
1X
House size
However, house cost vary even among same size
houses! Since cost behave unpredictably,
House we add a random component.
Cost
Y w
w Question: What should be
w considered a good line?
w
w w w w w
w w w w w
w
X
General linear model
Working concept of simple linear regression
• Ordinary least squares (OLS) method is Y w
usually used to implement simple linear w
regression. w
w
w
• A good line is one that minimizes the w
w
w
w
w
w w
w w
sum of squared differences between the w
points and the line. X
• The accuracy of each predicted value is
measured by its squared residual
(vertical distance between the point of
the data set and the fitted line), and the
goal is to make the sum of these
squared deviations as small as possible.
Sum of squared differences =(2 - 1)2 +(4 - 2)2 +(1.5 - 3)2 +(3.2 - 4)2 = 6.89
Sum of squared differences =(2 -2.5)2 +(4 - 2.5)2 +(1.5 - 2.5)2 +(3.2 - 2.5)2 = 3.99
(2,4)
Let us compare two lines
4
w The second line is horizontal
3 w (4,3.2)
2.5
2
(1,2)w
w (3,1.5)
1
X1 X2 X3
From the
From the first
first three
three assumptions
assumptions
we
we
have: YY isis normally
have: normally distributed
distributed
with mean
with mean E(Y) E(Y) == bb00 ++ bb11X,
X, and
and aa
constant standard
constant deviation ssee
standard deviation
Assessing the Model
• The least squares method will produce a regression line whether or
not there are linear relationship between X and Y.
• Consequently, it is important to assess how well the linear model fits
the data.
• Several methods are used to assess the model. All are based on the
sum of squares for errors, SSE.
Sum of Squares for Errors
• This is the sum of differences between the points and
the regression line.
• It can serve as a measure of how well the line fits the
data. SSE is defined by
n
SSE (Yi Yˆi ) 2 .
i1
– A shortcut formula
Y 2
sX
Standard Error of Estimate
• The mean error is equal to zero.
• If se is small the errors tend to be close to zero (close to
the mean error). Then, the model fits the data well.
• Therefore, we can, use se as a measure of the
suitability of using a linear model.
• An estimator of se is given by se