This document provides an introduction to linear regression analysis using SAS. It discusses identifying the dependent and independent variables, estimating the regression model, and using the model to make predictions. Key steps covered include checking assumptions like linearity, normality, and independence of errors. Diagnostic tools like residuals, influence statistics, and multicollinearity measures are also summarized. The goal is to fit a linear regression model to predict jet engine thrust based on various operational variables.
This document provides an introduction to linear regression analysis using SAS. It discusses identifying the dependent and independent variables, estimating the regression model, and using the model to make predictions. Key steps covered include checking assumptions like linearity, normality, and independence of errors. Diagnostic tools like residuals, influence statistics, and multicollinearity measures are also summarized. The goal is to fit a linear regression model to predict jet engine thrust based on various operational variables.
COMP-STAT GROUP 1 COMP-STAT GROUP The aim of this presentation to explain important steps involved in a Linear regression setup. We will proceed in a logical flow of the process. Identification ,estimation and prediction 2 Introduction The study of dependence Does changing the class size affect success of students Explaining the dependent variable based on a set of independent variables mathematically COMP-STAT GROUP 3 COMP-STAT GROUP Regression Models 4 The Model COMP-STAT GROUP Y is dependent variable Xs are independent variables is the error term Observe that the model is linear in the coefficients . What does linearity means? Simple linear regression : Model with only one predictor Estimation: Least square and/or maximum likelihood estimator 5 Assumptions Linearity Normality Homoscedasticity Independence (of explanatory variables, of error terms) Number of cases Data accuracy Missing Data Outliers What do they mean? COMP-STAT GROUP Main assumptions 6 Assumptions (contd.) Number of cases The cases to independent variable ration should ideally be 20:1(min 5:1) Accuracy of data that you had entered valid data points Missing data there treatment is necessary Outliers 7 COMP-STAT GROUP Objectives of analysis Estimation Hypothesis testing Confidence intervals Prediction of new observations Let us take a real life problem and then proceed further COMP-STAT GROUP 8 An example COMP-STAT GROUP 9 We have data on jet engine thrust as response variable & primary speed of rotation, secondary speed of rotation, fuel flow rate, pressure, exhaust temperature and ambient temperature at time of test as regressor variables The objective is to fit a linear regression model and check if our model satisfies all underlying assumptions and can predict future observations correctly Variable selection Important algorithms: Forward selection Backward elimination Stepwise regression (preferred) Always start with your domain knowledge. It will guide you through the selection of variables from a set of candidate variables. Dont rely too much on variable selection algorithm since they are too much computer dependant. COMP-STAT GROUP 10 Categorical independent variables How to incorporate qualitative variables in the analysis Concept of dummy variables We include k-1 dummies for a k categories One category is set as base category They act like usual variables in the linear regression setup Suppose we have three categories of TV A,B and C .Then we will include 2 dummies . let the dummies are X and Y then they will take value as follows X Y A 0 0 B 1 0 C 0 1 COMP-STAT GROUP 11 Post estimation concerns We had seen the model outputs and analyzed them Once the model is estimated next step is to check if our model satisfies all the assumptions stated If all the assumptions are satisfied we are good otherwise correction and modifications must be done to make the model ready for use COMP-STAT GROUP 12 Regression Diagnostics 13 COMP-STAT GROUP e i =y i - ^ y i Lower the residuals better the model. Types - Standardized residuals (Std.R) - Studentized residuals (Stdnt.R) - PRESS residuals - Rstudent residuals Std.R >3 ,indicates a potential outlier. better to look for Stdnt.R PRESS (prediction error sum of squares) Residuals Also called deleted residuals Estimate model by deleting that observation and then calculating the predicted value for that observation. The residual so obtained is PRESS residual Higher value indicates a high influence point SAS code Proc reg data=test; model y=x1 x2 x3 x4; output out=dataset STUDENT RSTUDENT PRESS COMP-STAT GROUP Residuals 14 Residual plots Normal probability plots Plot of normal quantiles against residual quantiles a straight line confirms normality assumption of residuals. Highly sensitive to non normality near two tails Can be helpful in outlier detection Statistical Tests Kolmogorov Smirnov test Anderson Darling test Shaipro-Wilk test COMP-STAT GROUP SAS code proc univariate data=residuals normal; /*normal option for normality tests*/ var r; qqplot r/normal(mu= est sigma=est); /*est is for estimating mean & variance from data itself*/ run; 15 Residual Plots (contd.) Homogeneity of error variance To check homoscedasticity assumption of the error variance If the assumption holds then the plot between residuals and predicted values should have a random pattern Also reveal one or more unusually large residuals which or course are potential outliers If the plot is not random you may need to apply some transformations on regressors White Test Tests the null hypothesis that the variance of the residual is homogenous Use the spec option in the model statement Remedy Resort to generalized least square estimators SAS Code Proc reg data=dataset; model y=x1 x2 x3/spec plot r.*p; /*plot residual vs. predicted values*/` 16 COMP-STAT GROUP Outlier Treatment Is an extreme observation Residuals considerably larger in absolute value than the others say 3 or 4 standard deviations from the mean indicate potential y-space outliers Are data points that are not typical of the rest of the data Residual plots and normal probability plot are helpful in identifying outliers Can also use studentized or R-Student residuals Should be removed from the data before estimating the model if it is a bad (?) value There should be strong non statistical evidence that the outlier is a bad value before it is discarded Sometimes desired in the analysis ( you want points of high yield or say low cost) 17 COMP-STAT GROUP Diagnostics for Leverage and influence Leverage o An observation with an extreme value on a predictor variable is called a point with high leverage o Leverage is a measure of how far an independent variable deviates from its mean o These leverage points can have an effect on the estimate of regression coefficients o Leverage (>(2p+2)/n) Influential Observations o An observation is said to be influential if removing the observation substantially changes the estimate of coefficients o Influence can be thought of as the product of leverage and outliers o Not all leverage points are going to be influential on the regression coefficients o desirable to consider both the location of the point in the x-space and the response variable in measuring the influence o Measures : Cooks D (>1), DFFITS(2p/n), DFBETAS(>2/n) SAS Code use COOKD=name1 DFFITS=name2 H=name3 /* H is for leverage*/ in the output option of proc reg (you can also use INFLUENCE in model option for detailed analysis) 18 COMP-STAT GROUP Multicollinearity When explanatory variables are not independent (near perfect linear relationship) Reasons Faulty data collection method Constraints on the model or in the population Model specification An over defined model Effect: Unstable coefficients estimate Inflated standard error of coff. Estimates Tools to detect Examine correlation matrix of independent variables\ Variance inflation factor (>10)(VIF) tolerance is 1/VIF condition indices (>1000) Variance decomposition proportions COMP-STAT GROUP 19 SAS code Proc reg data=test; model y=x1 x2/VIF TOL COLLINOINT; /*COLLINOINT gives a detailed collinearity analysis with intercept variable adjusted out. COLLIN option gives the same analysis with intercept*/ COMP-STAT GROUP Remedies Collecting additional data Model respecification Redefining the regressors Variable elimination 20 Linearity Scatter plot or matrix plot Plots variables against each other The linear relationship can be confirmed by observing a staright line trend SAS Code Proc sgscatter data=test; Matrix x1 x2 x3 x4 / group=name; Run; COMP-STAT GROUP 21 Independence of error terms We assume that error terms are independent of each other Can arise when observations are collected over time the problem of autocorrelation Durbin Watson test (~ 2 when error terms are uncorrelated) Use dw in the model option in proc reg to calculate durbin watson test Students of same school tend to be more alike than the other schools COMP-STAT GROUP 22