0% found this document useful (0 votes)
7 views

Lecture6 Regression

Regression analysis is a statistical method used to understand the relationship between variables. It can be used to predict the value of a dependent variable from independent variables. Common regression techniques include simple linear regression, multiple linear regression, and nonlinear regression. Key aspects of simple linear regression are estimating coefficients, assessing fit, and ensuring error terms meet assumptions like normality.

Uploaded by

Fasih Ullah
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture6 Regression

Regression analysis is a statistical method used to understand the relationship between variables. It can be used to predict the value of a dependent variable from independent variables. Common regression techniques include simple linear regression, multiple linear regression, and nonlinear regression. Key aspects of simple linear regression are estimating coefficients, assessing fit, and ensuring error terms meet assumptions like normality.

Uploaded by

Fasih Ullah
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Regression

Lecture 6
What is Regression
• Regression analysis is a statistical method that helps us to analyze and
understand the relationship between two or more variables of interest.
• It helps to understand which factors are important, which factors can be
ignored, and how they are influencing each other. In other words analyze
the specific relationships between the independent variables and the
dependent variable.
• In regression, we normally have one dependent variable and one or more
independent variables. Forecast the value of a dependent variable (Y) from
the value of independent variables (X1, X2,…Xk.).
• We try to “regress” the value of dependent variable “Y” with the help of
the independent variables.
Types of Regression approaches
• There are many types of regression approaches we will study some of
them here
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector for Regression (SVR)
• Decision Tree Regression
• Random Forest Regression
Simple linear regression
• In statistics, simple linear regression is a linear regression model with
a single explanatory variable.
• It concerns two-dimensional sample points with one independent
variable and one dependent variable (conventionally, the x and y
coordinates in a Cartesian coordinate system)
• It finds a linear function (a non-vertical straight line) that, as
accurately as possible, predicts the dependent variable values as a
function of the independent variable.
• Simply put, Simple linear regression is used to estimate the
relationship between two quantitative variables.
What can simple linear regression be used
for?
• You can use simple linear regression when you want to know:
• How strong the relationship is between two variables (e.g. the relationship
between rainfall and soil erosion).
• The value of the dependent variable at a certain value of the independent
variable (e.g. the amount of soil erosion at a certain level of rainfall).
Model for simple linear regression
• Consider the equation of line given as,

Yˆ  b0  b1 X
• Where y is the dependent variable, x is the independent variable, b0
is the y-intercept and b1 is the slope of the line.
• We need to find b0 and b1 to estimate y using x , such that the error Ɛ
 between the predicted value of y and original value of y.
is minimized
The Model

The model has a deterministic and a probabilistic components

House
Cost
ˆ
Y  b0  b1 Xcosts a b o ut

o u se t. S ize)
n g a h re foo + 75(
u ildi squa 5000
B p er =2
5 st
$7 se co
ˆ Most lots sell
Y  b0 for b$25,000 Ho u
1X

House size
However, house cost vary even among same size
houses! Since cost behave unpredictably,
House we add a random component.
Cost

Most lots sell


for $25,000
House cost = 25000 + 75 (Size)+ e
House size
• The first order linear model
Y   0  1 X  
Y = dependent variable
X = independent variable and b are unknown population
Y parameters, therefore are estimated
b0 = Y-intercept from the data.
b = slope of the line
 e = error variable
Rise b1 = Rise/Run
b0 Run
X
Estimating the Coefficients
• The estimates are determined by
• drawing a sample from the population of interest,
• calculating sample statistics.
• producing a straight line that cuts into the data.

Y w
w Question: What should be
w considered a good line?
w
w w w w w
w w w w w
w
X
General linear model
Working concept of simple linear regression
• Ordinary least squares (OLS) method is Y w
usually used to implement simple linear w
regression. w
w
w
• A good line is one that minimizes the w
w
w
w
w
w w
w w
sum of squared differences between the w
points and the line. X
• The accuracy of each predicted value is
measured by its squared residual
(vertical distance between the point of
the data set and the fitted line), and the
goal is to make the sum of these
squared deviations as small as possible.
Sum of squared differences =(2 - 1)2 +(4 - 2)2 +(1.5 - 3)2 +(3.2 - 4)2 = 6.89
Sum of squared differences =(2 -2.5)2 +(4 - 2.5)2 +(1.5 - 2.5)2 +(3.2 - 2.5)2 = 3.99

(2,4)
Let us compare two lines
4
w The second line is horizontal
3 w (4,3.2)
2.5
2
(1,2)w
w (3,1.5)
1

The smaller the sum of


1 2 3 4 squared differences
the better the fit of the
line to the data.
The Simple Linear Regression Line
Example
The values of y and their corresponding values of y are
shown in the table below
x 0 1 2 3 4
y 2 3 5 4 6

Find the least square regression line y = a x + b.


Solution:
We use a table to calculate a and b.
x y xy x2
We now calculate a and b using the least square
0 2 0 0 regression formulas for a and b.
1 3 3 1 a = 0.9
b = 2.2
2 5 10 4
Now that we have the least square regression line
3 4 12 9 y = 0.9 x + 2.2,
4 6 24 16
?x = 10 ?y = 20 ?x y = 49 ?x2 = 30
Example 2
a)
a = 8.4
b = 11.6
b) In 2012, t = 2012 - 2005 = 7
The estimated sales in 2012 are: y = 8.4 * 7 + 11.6 = 70.4
million dollars.
Error Variable: Required Conditions for
better performance of simple linear regression
• The error e is a critical part of the regression model.
• Four requirements involving the distribution of e must
be satisfied.
• The probability distribution of e is normal.
• The mean of e is zero: E(e) = 0.
• The standard deviation of e is se for all values of X.
• The set of errors associated with different values of Y are all
independent.
The Normality of e
E(Y|X3)
The standard deviation remains constant,
b +b X m3
0 1 3
E(Y|X2)
b0 + b1 X2 m2

but the mean value changes with XE(Y|X1)


b0 + b1X1 m1

X1 X2 X3
From the
From the first
first three
three assumptions
assumptions
we
we
have: YY isis normally
have: normally distributed
distributed
with mean
with mean E(Y) E(Y) == bb00 ++ bb11X,
X, and
and aa
constant standard
constant deviation ssee
standard deviation
Assessing the Model
• The least squares method will produce a regression line whether or
not there are linear relationship between X and Y.
• Consequently, it is important to assess how well the linear model fits
the data.
• Several methods are used to assess the model. All are based on the
sum of squares for errors, SSE.
Sum of Squares for Errors
• This is the sum of differences between the points and
the regression line.
• It can serve as a measure of how well the line fits the
data. SSE is defined by

n
SSE   (Yi  Yˆi ) 2 .
i1

– A shortcut formula

 SSE  (n  1)s2  cov(X,Y)


2

Y 2
sX
Standard Error of Estimate
• The mean error is equal to zero.
• If se is small the errors tend to be close to zero (close to
the mean error). Then, the model fits the data well.
• Therefore, we can, use se as a measure of the
suitability of using a linear model.
• An estimator of se is given by se

S tan dard Error of Estimate


SSE
s 
n2
Assumptions of simple linear regression
• Simple linear regression is a parametric test, meaning that it makes certain
assumptions about the data. These assumptions are:
• Homogeneity of variance (homoscedasticity): the size of the error in our prediction
doesn’t change significantly across the values of the independent variable.
• Independence of observations: the observations in the dataset were collected using
statistically valid sampling methods, and there are no hidden relationships among
observations.
• Normality: The data follows a normal distribution.
• The relationship between the independent and dependent variable is linear: the line of
best fit through the data points is a straight line (rather than a curve or some sort of
grouping factor).
• If your data do not meet the assumptions of homoscedasticity or normality, you
may be able to use a nonparametric test instead, such as the Spearman rank test.
Example: Data that doesn’t meet the
assumptions
• You think there is a linear relationship between meat consumption
and the incidence of cancer in the U.S.
• However, you find that much more data has been collected at high
rates of meat consumption than at low rates of meat consumption,
• With the result that there is much more variation in the estimate of
cancer rates at the low range than at the high range.
• Because the data violate the assumption of homoscedasticity, it
doesn’t work for regression.
Implementing simple linear regression in
Python
1.Import the packages and classes .
2.Import the data
3.Visualize the data
4.Handle missing values and clean the data
5.Split the data into training and test sets
6.Build the regression model and train it.
7.Check the results of model fitting to know whether the model is
satisfactory using plots.
8.Make predictions using unseen data.
Importing packages and data
Visualize the data
Handle missing values and clean the data

• Missing data present


• Data cleaning is required as
salary cannot be negative
code
Visualizing the processed data
Split the data into training and test sets
Build the regression model and train it.

• Import the linear regression class from the linear model


• Make an instance of the linear regression class
• The train the model using training data
Check the results of model fitting to know
whether the model is satisfactory using plots.
Make predictions using unseen data.
Another Example
Import dataset and visualize
Data cleaning
Visualize the data
Splitting the data
Build model and train it
Predicting the output for unseen data

You might also like