0% found this document useful (0 votes)
769 views

Introduction To Linear Regression Analysis

This document provides an introduction to linear regression analysis using SAS. It discusses identifying the dependent and independent variables, estimating the regression model, and using the model to make predictions. Key steps covered include checking assumptions like linearity, normality, and independence of errors. Diagnostic tools like residuals, influence statistics, and multicollinearity measures are also summarized. The goal is to fit a linear regression model to predict jet engine thrust based on various operational variables.

Uploaded by

Nikhil Gandhi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
769 views

Introduction To Linear Regression Analysis

This document provides an introduction to linear regression analysis using SAS. It discusses identifying the dependent and independent variables, estimating the regression model, and using the model to make predictions. Key steps covered include checking assumptions like linearity, normality, and independence of errors. Diagnostic tools like residuals, influence statistics, and multicollinearity measures are also summarized. The goal is to fit a linear regression model to predict jet engine thrust based on various operational variables.

Uploaded by

Nikhil Gandhi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Introduction to linear regression analysis

-with applications on SAS


COMP-STAT GROUP 1
COMP-STAT GROUP
The aim of this presentation to explain important steps
involved in a Linear regression setup.
We will proceed in a logical flow of the process.
Identification ,estimation and prediction
2
Introduction
The study of dependence
Does changing the class size affect success of students
Explaining the dependent variable based on a set of
independent variables mathematically
COMP-STAT GROUP 3
COMP-STAT GROUP
Regression Models
4
The Model
COMP-STAT GROUP
Y is dependent variable
Xs are independent variables
is the error term
Observe that the model is linear in the coefficients .
What does linearity means?
Simple linear regression : Model with only one predictor
Estimation: Least square and/or maximum likelihood estimator
5
Assumptions
Linearity
Normality
Homoscedasticity
Independence
(of explanatory variables, of error terms)
Number of cases
Data accuracy
Missing Data
Outliers
What do they mean?
COMP-STAT GROUP
Main
assumptions
6
Assumptions (contd.)
Number of cases
The cases to independent variable ration should ideally be 20:1(min 5:1)
Accuracy of data
that you had entered valid data points
Missing data
there treatment is necessary
Outliers
7 COMP-STAT GROUP
Objectives of analysis
Estimation
Hypothesis testing
Confidence intervals
Prediction of new observations
Let us take a real life problem and then
proceed further
COMP-STAT GROUP 8
An example
COMP-STAT GROUP 9
We have data on jet engine thrust as response variable & primary
speed of rotation, secondary speed of rotation, fuel flow rate,
pressure, exhaust temperature and ambient temperature at time of
test as regressor variables
The objective is to fit a linear regression model and check if our
model satisfies all underlying assumptions and can predict future
observations correctly
Variable selection
Important algorithms:
Forward selection
Backward elimination
Stepwise regression (preferred)
Always start with your domain knowledge. It will guide you through
the selection of variables from a set of candidate variables.
Dont rely too much on variable selection algorithm since they are too
much computer dependant.
COMP-STAT GROUP 10
Categorical independent variables
How to incorporate qualitative variables in the analysis
Concept of dummy variables
We include k-1 dummies for a k categories
One category is set as base category
They act like usual variables in the linear regression setup
Suppose we have three categories of TV A,B and C .Then we will
include 2 dummies .
let the dummies are X and Y then they will take value as follows
X Y
A 0 0
B 1 0
C 0 1
COMP-STAT GROUP 11
Post estimation concerns
We had seen the model outputs and analyzed them
Once the model is estimated next step is to check
if our model satisfies all the assumptions stated
If all the assumptions are satisfied we are good otherwise
correction and modifications must be done to make the
model ready for use
COMP-STAT GROUP 12
Regression Diagnostics
13 COMP-STAT GROUP
e
i
=y
i
-
^
y
i
Lower the residuals better the model.
Types - Standardized residuals (Std.R)
- Studentized residuals (Stdnt.R)
- PRESS residuals
- Rstudent residuals
Std.R >3 ,indicates a potential outlier.
better to look for Stdnt.R
PRESS (prediction error sum of squares) Residuals
Also called deleted residuals
Estimate model by deleting that observation and then calculating the
predicted value for that observation. The residual so obtained is PRESS
residual
Higher value indicates a high influence point
SAS code
Proc reg data=test;
model y=x1 x2 x3 x4;
output out=dataset STUDENT RSTUDENT PRESS
COMP-STAT GROUP
Residuals
14
Residual plots
Normal probability plots
Plot of normal quantiles against residual
quantiles
a straight line confirms normality
assumption of residuals.
Highly sensitive to non normality
near two tails
Can be helpful in outlier detection
Statistical Tests
Kolmogorov Smirnov test
Anderson Darling test
Shaipro-Wilk test
COMP-STAT GROUP
SAS code
proc univariate data=residuals normal; /*normal option for normality tests*/
var r;
qqplot r/normal(mu= est sigma=est);
/*est is for estimating mean & variance from data itself*/
run;
15
Residual Plots (contd.)
Homogeneity of error variance
To check homoscedasticity assumption of the
error variance
If the assumption holds then the plot between
residuals and predicted values should have a
random pattern
Also reveal one or more unusually large residuals
which or course are potential outliers
If the plot is not random you may need to apply
some transformations on regressors
White Test
Tests the null hypothesis that the variance of the
residual is homogenous
Use the spec option in the model statement
Remedy
Resort to generalized least square estimators
SAS Code
Proc reg data=dataset;
model y=x1 x2 x3/spec
plot r.*p; /*plot residual vs. predicted values*/`
16 COMP-STAT GROUP
Outlier Treatment
Is an extreme observation
Residuals considerably larger in absolute value than the others say 3 or 4 standard
deviations from the mean indicate potential y-space outliers
Are data points that are not typical of the rest of the data
Residual plots and normal probability plot are helpful in identifying outliers
Can also use studentized or R-Student residuals
Should be removed from the data before estimating the model if it is a bad (?) value
There should be strong non statistical evidence that the outlier is a bad value before
it is discarded
Sometimes desired in the analysis ( you want points of high yield or say low cost)
17 COMP-STAT GROUP
Diagnostics for Leverage and influence
Leverage
o An observation with an extreme value on a
predictor variable is called a point with high
leverage
o Leverage is a measure of how far an independent
variable deviates from its mean
o These leverage points can have an effect on the
estimate of regression coefficients
o Leverage (>(2p+2)/n)
Influential Observations
o An observation is said to be influential if removing
the observation substantially changes the estimate
of coefficients
o Influence can be thought of as the product of
leverage and outliers
o Not all leverage points are going to be influential on
the regression coefficients
o desirable to consider both the location of the point
in the x-space and the response variable in
measuring the influence
o Measures :
Cooks D (>1), DFFITS(2p/n), DFBETAS(>2/n)
SAS Code
use
COOKD=name1
DFFITS=name2
H=name3 /* H is for leverage*/
in the output option of proc reg
(you can also use INFLUENCE in
model option for detailed analysis)
18 COMP-STAT GROUP
Multicollinearity
When explanatory variables are not independent (near perfect
linear relationship)
Reasons
Faulty data collection method
Constraints on the model or in the population
Model specification
An over defined model
Effect:
Unstable coefficients estimate
Inflated standard error of coff. Estimates
Tools to detect
Examine correlation matrix of independent variables\
Variance inflation factor (>10)(VIF) tolerance is 1/VIF
condition indices (>1000)
Variance decomposition proportions
COMP-STAT GROUP 19
SAS code
Proc reg data=test;
model y=x1 x2/VIF TOL COLLINOINT;
/*COLLINOINT gives a detailed collinearity analysis with intercept
variable adjusted out. COLLIN option gives the same analysis with
intercept*/
COMP-STAT GROUP
Remedies
Collecting additional data
Model respecification
Redefining the regressors
Variable elimination
20
Linearity
Scatter plot or matrix plot
Plots variables against each other
The linear relationship can be confirmed by observing a staright line trend
SAS Code
Proc sgscatter data=test;
Matrix x1 x2 x3 x4 / group=name;
Run;
COMP-STAT GROUP 21
Independence of error terms
We assume that error terms are independent of each
other
Can arise when observations are collected over time
the problem of autocorrelation
Durbin Watson test (~ 2 when error terms are uncorrelated)
Use dw in the model option in proc reg to calculate durbin watson test
Students of same school tend to be more alike than the
other schools
COMP-STAT GROUP 22

You might also like